AI/Infra Inferencing — Serve Qwen 3 235B Thinking on Google Cloud A4 (B200 GPUs) using vLLM and GKE | by Ammett W | Google Cloud - Community

Qwew3 what is that? Qwen is a LLM from Alibaba with and open licenses and is also available on Hugging face. In this blog demo you’s run the Qwen/Qwen3–235B-A22B-Thinking-2507 model (impressive number) on state-of-the-art NVIDIA B200 GPUs available on Google Cloud A4 VM GPU family.

We’ll use a GKE Autopilot cluster and vLLM for inference.

You can also just jump right to the Google Cloud Documentation and deploy the faster Qwen3–235B-A22B-Instruct-2507 (without thinking) with the tutorial here
✅Use vLLM on GKE to run inference with Qwen3

Let’s get started!

Here’s what we’ll accomplish:

Select the Qwen 3 model on Hugging Face.
Deploy a GKE Autopilot cluster with a powerful A4 node pool.
Use vLLM to serve our model efficiently.
Configure podmonitoring resource to get metrics from our vLLM server.
Expose the model internally with a ClusterIP service.
Run a quick inference test to see it all in action.

You’ll need a Google Cloud project with billing enabled and a reservation for the A4 machine type to follow this guide. To get a future reservation for the A3ultra, A4 and A4X VM families you may need to contact you TAM or Sales team.

Select or create a project to use for your resources and billing

gcloud services enable container.googleapis.com

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.admin

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Configure variables

gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export RESERVATION_URL=RESERVATION_URL
export REGION=REGION
export CLUSTER_NAME=CLUSTER_NAME
export HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN
export NETWORK=NETWORK_NAME
export SUBNETWORK=SUBNETWORK_NAME

Sign into your Hugging Face account https://huggingface.co/login
Navigate to Qwen3 model (Qwen3–235B-A22B-Thinking-2507)
You’ll need to access the licenses agreement to get access to Meta’s models.
Next create a token, Click Your Profile > Settings > Access Tokens> +Create new token
Specify a Name of your choice and a Role of at least Read.
Select Generate a token.
Copy the generated token to your clipboard for later use.

GKE cluster in Autopilot mode, run the following command:

gcloud container clusters create-auto $CLUSTER_NAME \
--project=$PROJECT_ID \
--region=$REGION \
--release-channel=rapid \
--network=$NETWORK \
--subnetwork=$SUBNETWORK

Creating the GKE cluster might take some time to complete.

2. Connect to cluster

gcloud container clusters get-credentials $CLUSTER_NAME \
--location=$REGION

3. Configure a secret for hugging face

kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HUGGING_FACE_TOKEN}

Now you can deploy stuff in this case pods to runn vLLM and server model, service ClusterIP to expose the workload and a pod monitoring definition to get metrics from the vLLM container.

Create a deployment manifest called qwen3thinkdeploy.yaml with the following content.

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen3-deployment
spec:
replicas: 1
selector:
matchLabels:
app: qwen3-server
template:
metadata:
labels:
app: qwen3-server
ai.gke.io/model: Qwen3-235B-A22B-Thinking-2507
ai.gke.io/inference-server: vllm
spec:
containers:
- name: qwen-inference-server
image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250801_0916_RC01
resources:
requests:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "500Gi"
nvidia.com/gpu: "8"
limits:
cpu: "10"
memory: "1000Gi"
ephemeral-storage: "500Gi"
nvidia.com/gpu: "8"
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=8
- --host=0.0.0.0
- --port=8000
- --max-model-len=8192
- --max-num-seqs=4
- --dtype=bfloat16
env:
- name: MODEL_ID
value: "Qwen/Qwen3-235B-A22B-Thinking-2507"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1800
periodSeconds: 5
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-b200
cloud.google.com/reservation-name: YOUR_RESERVATION_NAME
cloud.google.com/reservation-affinity: "specific"
cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
name: qwen3-service
spec:
selector:
app: qwen3-server
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: qwen3-monitoring
spec:
selector:
matchLabels:
app: qwen3-server
endpoints:
- port: 8000
path: /metrics
interval: 30s

ps. Ensure you put in the name of your reservation in place of YOUR_RESERVATION_NAME for this to work

2. Run the deployment

kubectl apply -f qwen3thinkdeploy.yaml

3. You can monitor the deployment using various commands, example

kubectl get deployment #show all deploymentsKubectl get pods #shows all pods
kubectl describe deployment ADD_DEPLOYMENT_NAME #shows you the details
kubectl describe pod ADD_POD_NAME #shows you the details of pod
kubectl logs ADD_POD_NAME #shows you the pod logs

You can may a call to the LLM with a simple test

Set up port forwarding to Qwen3:

kubectl port-forward service/qwen3-server 8000:8000

Open a new terminal window. You can then chat with your model by using curl:

curl http://127.0.0.1:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-235B-A22B-Thinking-2507",
"messages": [
{
"role": "user",
"content": "Describe a the qwen model in one short sentence?"
}
]
}'

This model will do some thinking which may take a little time depending on the complextiy of the question. For faster responses you can also checkout the non thinking models. You should get a creative reply straight from the AI endpoint once it finished thinking!

From here, you could expose this service publicly with a Load Balancer or gateway band build a Streamlit app to interact with it.

Delete the deployment and secret

kubectl delete -f qwen3thinkdeploy.yaml
kubectl delete secret hf-secret

2. Delete the cluster


gcloud container clusters delete $CLUSTER_NAME \
--region=$REGION

You can checkout the Qwen3–235B-A22B-Instruct-2507 (without thinking) tutorial on the Google Cloud documentation here
✅Use vLLM on GKE to run inference with Qwen3

Other Tutorials on A4 VM’s (Gemma & LLama 4):
✅Use vLLM on GKE to run inference with Llama 4
✅Deploy and serve Gemma 3 27B inference with vLLM on GKE

To connect or ask a question please check me out on LinkedIn.

I’ll be in touch

Source Credit: https://medium.com/google-cloud/ai-infra-inferencing-serve-qwen-3-235b-thinking-on-google-cloud-a4-b200-gpus-using-vllm-and-gke-a7dc7ec03883?source=rss—-e52cf94d98af—4