


To power the next generation of coding agents, open models tailored for software development offer developers unprecedented power and flexibility. OpenHands LM 32B is built on Qwen Coder 2.5 Instruct 32B and has been refined using data from the OpenHands agent. This 32B parameter model boasts a 128k token context window suited for complex coding scenarios.
However, accessing a capable open model is just the starting point. How do you integrate it reliably into your agent’s workflow within your own infrastructure? Running these models efficiently requires a dedicated serving solution. That’s where the combination of Google Kubernetes Engine (GKE) for orchestration and the vLLM framework for optimized inference comes into play.
This guide dives into the practical steps of deploying the OpenHands LM 32B model onto GKE using vLLM. We’ll show you how to build a robust, scalable, and self-hosted inference endpoint to power your OpenHands agent, giving you full control over performance, privacy, and cost.
Serving the OpenHands LM model yourself using GKE and vLLM offers several advantages for your agent workflows:
- Control: Keep your code and agent interactions within your infrastructure. Running the LLM locally gives you full authority over its deployment, updates, and overall lifecycle.
- Performance: vLLM is a highly optimized serving framework designed for LLMs. It leverages techniques like PagedAttention and continuous batching to maximize throughput on GPU hardware, providing faster responses for your agent.
- Scalability & Management: GKE provides a robust, managed Kubernetes environment. It simplifies deploying, scaling, and managing containerized applications like vLLM, complete with GPU support (including powerful options like NVIDIA A100s) and autoscaling capabilities.
- Cost-Effectiveness: GKE allows granular control over resources, and combined with vLLM’s efficiency, can potentially reduce inference costs compared to proprietary API calls for your agent.
- Open Standards: Utilize open-source tools (OpenHands agent, OpenHands LM model, vLLM, Kubernetes) for a flexible and open stack. vLLM provides an OpenAI-compatible API endpoint, making integration with agents like OpenHands seamless.
Let’s get hands-on with deploying the model.
Prerequisites:
1. Set up your GKE Cluster
To get started, start Google Cloud Shell and authenticate: gcloud auth login
.
You’ll need a GKE cluster with GPU nodes. You can use either Autopilot (for a more managed experience) or Standard. Ensure you select a region/zone that supports the GPUs you intend to use (e.g., nvidia-h100–80gb). Consider using Spot VMs or Preemptible VMs in your node pools for potential cost savings, as the deployment manifest includes tolerations for them.
A basic Autopilot cluster creation might look like this (adjust region and channel as needed):
export PROJECT_ID="your-project-id"
export REGION="us-central1" # Choose a region with H100-80GB GPUs
export CLUSTER_NAME="openhands-cluster"gcloud config set project $PROJECT_ID
gcloud container clusters create-auto ${CLUSTER_NAME} \
--region=${REGION} \
--release-channel=rapid
gcloud container clusters get-credentials ${CLUSTER_NAME} --region=${REGION}
(Ensure your cluster configuration allows for the creation of nodes with the required H100 GPUs, potentially using Spot/Preemptible options if desired).
2. Create a Kubernetes Secret for Hugging Face
Store your Hugging Face token securely in a Kubernetes secret so vLLM can download the model:
export HF_TOKEN="your-hugging-face-token"kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN}
3. Define the vLLM Deployment
Now, create a Kubernetes deployment manifest (openhands-deployment.yaml) to run the vLLM server with the OpenHands LM model. This configuration uses an H100 GPU and includes several performance tuning parameters:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-openhands-deployment
spec:
replicas: 1
selector:
matchLabels:
app: openhands-server
template:
metadata:
labels:
app: openhands-server # Label applied to pods
ai.gke.io/model: openhands-lm-32b-v0.1
ai.gke.io/inference-server: vllm
spec:
# Tolerations allow scheduling on Spot/Preemptible VMs
tolerations:
- key: "cloud.google.com/gke-preemptible"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "cloud.google.com/gke-spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: inference-server
image: docker.io/vllm/vllm-openai:latest
resources:
requests:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "120Gi"
nvidia.com/gpu : "1"
limits:
cpu: "10"
memory: "128Gi"
ephemeral-storage: "120Gi"
nvidia.com/gpu : "1"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID) # Model is specified via env var
- --tensor-parallel-size=1 # Single GPU tensor parallelism
- --host=0.0.0.0 # Listen on all interfaces
- --port=8000 # Port for the API server
- --swap-space=16 # VRAM swap space (GB)
- --gpu-memory-utilization=0.95 # Target GPU memory usage
- --max-model-len=16384 # Max model sequence length (tuned value)
- --max-num-seqs=256 # Max concurrent sequences (tuned for throughput)
- --dtype=bfloat16 # Use bfloat16 precision (good for H100)
env:
- name: MODEL_ID # Environment variable for the model name
value: all-hands/openhands-lm-32b-v0.1
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm # Shared memory for inter-process communication
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory # Use RAM for shared memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-h100-80gb
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: openhands-server # Must match the Deployment's pod label
type: ClusterIP # Internal service, accessible within the cluster
ports:
- protocol: TCP
port: 8000
targetPort: 8000
Key points in this manifest:
- tolerations: Allows pods to be scheduled onto GKE nodes marked as preemptible or Spot VMs, which can offer significant cost savings but may be interrupted.
- image: Uses the standard
vllm/vllm-openai:latest
image directly from Docker Hub. - args:
— max-model-len=16384
: The maximum sequence length is set lower than the model’s maximum (128k), a tuning choice for memory or performance trade-offs.— max-num-seqs=256
: The maximum number of sequences (which generally correspond to individual user requests or prompts) that the vLLM engine can process concurrently in a single batch— dtype=bfloat16
: Specifies the use of bfloat16 precision, which often improves performance on compatible hardware like H100 GPUs with minimal impact on accuracy.- env: Includes the
MODEL_ID
environment variable used in the arguments. - nodeSelector: Specifically targets
nvidia-h100–80gb
GPUs.
4. Deploy to GKE
Apply the manifest to your cluster:
kubectl apply -f openhands-deployment.yaml
Monitor the pod status:
kubectl get pods -l app=openhands-server -w
Wait until the pod is in the Running state. It might take some time initially as the container image is pulled and the large model files are downloaded from Hugging Face. You can check the logs for progress:
kubectl logs -f -l app=openhands-server -c inference-server
Look for messages indicating the server has started successfully, such as:
INFO: Application startup complete.
5. Testing your model
Once you see that the vLLM server has started successfully, you can now test the endpoint. Consider creating another Cloud Shell tab for this step.
First, use port forwarding to create a secure tunnel from your local machine to the service inside GKE:
kubectl port-forward service/llm-service 8000:8000
Next, in another tab, use curl
or another tool to interact with the API. Send a test prompt:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "all-hands/openhands-lm-32b-v0.1",
"prompt": "Implement a Python function called find_longest_word that takes a single argument: a list of strings. The function should return the longest word from the list. If there are multiple words with the same maximum length, return the first one encountered in the list. If the input list is empty, return None.",
"max_tokens": 8192
}'
We get a JSON response back. Formatting the code portion of the response, we see:
def find_longest_word(words):
"""
Find the longest word in a list of strings, ignoring non-string elements.Args:
words (list): A list of elements which may include strings.
Returns:
str or None: The longest string in the list, or None if no strings are present.
"""
# Initialize variables to track the longest word and its length
longest = None
max_length = 0
# Iterate over each element in the list
for word in words:
# Check if the element is a string
if isinstance(word, str):
# Update longest word and max length if current word is longer
# NOTE: The original response code uses `len(word) > max_length`,
# but the docstring example implies "banana" > "cherry"
# which might be alphabetical or length based. Assuming length.
current_length = 0
for _ in word: # Manual length calculation
current_length += 1
if current_length > max_length:
longest = word
max_length = current_length
# Return the longest word found, or None if no strings were present
return longest
# Test cases
print(find_longest_word(["apple", "banana", "cherry"])) # Expected based on length: "banana" or "cherry" (tie) -> Actual: "banana" due to order? Or maybe "cherry" if last wins tie. The code prioritizes first longest.
print(find_longest_word(["dog", 123, "elephant", "cat"])) # Output: "elephant"
print(find_longest_word([])) # Output: None
print(find_longest_word([1, 2, 3])) # Output: None
- Performance Tuning: Experiment with the vLLM arguments used in the manifest, such as
— gpu-memory-utilization=0.95
,— max-num-seqs=256
,— max-model-len=16384
, and— dtype=bfloat16
. These values are starting points and may need adjustment based on your specific workload, desired latency/throughput trade-off, and exact hardware (H100 80GB in this example). - Scaling: For production loads, use a GKE
HorizontalPodAutoscaler
(HPA) targeting CPU or GPU utilization, or custom vLLM metrics (likevllm:num_requests_waiting
via Prometheus) to automatically scale your deployment replicas. - Cost Optimization: The inclusion of tolerations for Spot/Preemptible VMs allows scheduling on lower-cost, interruptible nodes. Ensure your application or workflow can handle potential node interruptions if you utilize this.
- Model Loading Time: For large models like this 32B parameter one, initial download and loading onto new pods during scaling can be slow. For faster scaling and pod startup times, consider pre-downloading the model onto a persistent volume, potentially using GKE Image Streaming or Hyperdisk ML for optimized read performance, as detailed in the GKE documentation.
- Model Limitations: As a research preview, evaluate OpenHands LM carefully. It’s tuned for GitHub issues, may occasionally repeat steps, and is sensitive to quantization. Expect improvements in later releases and a future 7B parameter model version.
The release of OpenHands LM 32B is a significant stride towards powerful, open language models purpose-built for coding tasks. By deploying it on Google Kubernetes Engine using the high-performance vLLM framework, you have a specialized coding “brain” to support your coding agents. This setup empowers you to leverage cutting-edge AI for software development within your own environment.
Explore the OpenHands LM model, experiment with this GKE deployment strategy, and try out a coding agent! What are your thoughts on the future of coding agents? Share your perspective by continuing the discussion with me on LinkedIn, X, or BlueSky.
Source Credit: https://medium.com/google-cloud/serving-the-openhands-lm-coding-model-a-gke-and-vllm-deployment-guide-96ab7932586a?source=rss—-e52cf94d98af—4