
Qwew3 what is that? Qwen is a LLM from Alibaba with and open licenses and is also available on Hugging face. In this blog demo you’s run the Qwen/Qwen3–235B-A22B-Thinking-2507 model (impressive number) on state-of-the-art NVIDIA B200 GPUs available on Google Cloud A4 VM GPU family.
We’ll use a GKE Autopilot cluster and vLLM for inference.
You can also just jump right to the Google Cloud Documentation and deploy the faster Qwen3–235B-A22B-Instruct-2507 (without thinking) with the tutorial here
✅Use vLLM on GKE to run inference with Qwen3
Let’s get started!
Here’s what we’ll accomplish:
- Select the Qwen 3 model on Hugging Face.
- Deploy a GKE Autopilot cluster with a powerful A4 node pool.
- Use vLLM to serve our model efficiently.
- Configure podmonitoring resource to get metrics from our vLLM server.
- Expose the model internally with a ClusterIP service.
- Run a quick inference test to see it all in action.
You’ll need a Google Cloud project with billing enabled and a reservation for the A4 machine type to follow this guide. To get a future reservation for the A3ultra, A4 and A4X VM families you may need to contact you TAM or Sales team.
Select or create a project to use for your resources and billing
gcloud services enable container.googleapis.com
- Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/container.admin
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
Configure variables
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAME
- Sign into your Hugging Face account https://huggingface.co/login
- Navigate to Qwen3 model (Qwen3–235B-A22B-Thinking-2507)
- You’ll need to access the licenses agreement to get access to Meta’s models.
- Next create a token, Click Your Profile > Settings > Access Tokens> +Create new token
- Specify a Name of your choice and a Role of at least
Read
. - Select Generate a token.
- Copy the generated token to your clipboard for later use.
- GKE cluster in Autopilot mode, run the following command:
gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORK
Creating the GKE cluster might take some time to complete.
2. Connect to cluster
gcloud container clusters get-credentials $CLUSTER_NAME \
--location=$REGION
3. Configure a secret for hugging face
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HUGGING_FACE_TOKEN}
Now you can deploy stuff in this case pods to runn vLLM and server model, service ClusterIP to expose the workload and a pod monitoring definition to get metrics from the vLLM container.
- Create a deployment manifest called qwen3thinkdeploy.yaml with the following content.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen3-deployment
spec:
replicas: 1
selector:
matchLabels:
app: qwen3-server
template:
metadata:
labels:
app: qwen3-server
ai.gke.io/model: Qwen3-235B-A22B-Thinking-2507
ai.gke.io/inference-server: vllm
spec:
containers:
- name: qwen-inference-server
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250801_0916_RC01
resources:
requests:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "500Gi"
nvidia.com/gpu: "8"
limits:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "500Gi"
nvidia.com/gpu: "8"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=8
- --host=0.0.0.0
- --port=8000
- --max-model-len=8192
- --max-num-seqs=4
- --dtype=bfloat16
env:
- name: MODEL_ID
value: "Qwen/Qwen3-235B-A22B-Thinking-2507"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 5
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
cloud.google.com/reservation-name: YOUR_RESERVATION_NAME
cloud.google.com/reservation-affinity: "specific"
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: qwen3-service
spec:
selector:
app: qwen3-server
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: qwen3-monitoring
spec:
selector:
matchLabels:
app: qwen3-server
endpoints:
- port: 8000
path: /metrics
interval: 30s
ps. Ensure you put in the name of your reservation in place of YOUR_RESERVATION_NAME for this to work
2. Run the deployment
kubectl apply -f qwen3thinkdeploy.yaml
3. You can monitor the deployment using various commands, example
kubectl get deployment #show all deploymentsKubectl get pods #shows all pods
kubectl describe deployment ADD_DEPLOYMENT_NAME #shows you the details
kubectl describe pod ADD_POD_NAME #shows you the details of pod
kubectl logs ADD_POD_NAME #shows you the pod logs
You can may a call to the LLM with a simple test
- Set up port forwarding to Qwen3:
kubectl port-forward service/qwen3-server 8000:8000
- Open a new terminal window. You can then chat with your model by using curl:
curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-235B-A22B-Thinking-2507",
"messages": [
{
"role": "user",
"content": "Describe a the qwen model in one short sentence?"
}
]
}'
This model will do some thinking which may take a little time depending on the complextiy of the question. For faster responses you can also checkout the non thinking models. You should get a creative reply straight from the AI endpoint once it finished thinking!
From here, you could expose this service publicly with a Load Balancer or gateway band build a Streamlit app to interact with it.
- Delete the deployment and secret
kubectl delete -f qwen3thinkdeploy.yaml
kubectl delete secret hf-secret
2. Delete the cluster
gcloud container clusters delete $CLUSTER_NAME \
--region=$REGION
You can checkout the Qwen3–235B-A22B-Instruct-2507 (without thinking) tutorial on the Google Cloud documentation here
✅Use vLLM on GKE to run inference with Qwen3
Other Tutorials on A4 VM’s (Gemma & LLama 4):
✅Use vLLM on GKE to run inference with Llama 4
✅Deploy and serve Gemma 3 27B inference with vLLM on GKE
To connect or ask a question please check me out on LinkedIn.
I’ll be in touch
Source Credit: https://medium.com/google-cloud/ai-infra-inferencing-serve-qwen-3-235b-thinking-on-google-cloud-a4-b200-gpus-using-vllm-and-gke-a7dc7ec03883?source=rss—-e52cf94d98af—4