Optimize Gemma 3 Inference: vLLM on GKE | Google Cloud

The funny thing is, enabling such features on GKE Autopilot is just two arguments away:

gcloud container clusters update vllm-cluster \
--region us-central1 \
    --monitoring=SYSTEM,DCGM \
--auto-monitoring-scope=ALL

Selecting a GPU capable of handling Gemma 3 27B was a critical consideration, driven by its substantial memory footprint. The 27 billion parameter model necessitates a minimum of 46.4 GB of VRAM, not accounting for the KV cache, CUDA kernels, and other runtime overhead. For this initial investigation, I opted to benchmark the full bf16 precision model to ensure reproducibility. While quantization techniques, readily accessible via tools like vLLM-Compressor, offer significant memory reduction, their implementation involves algorithm selection and rigorous performance evaluation — a topic deserving of its own dedicated exploration. Anyone interested in vLLM’s memory optimization capabilities can find comprehensive information in the official Google Cloud documentation:

Alternatively, for those keen to explore the intricacies of quantization and compression techniques, I highly recommend Red Hat’s in-depth paper on the subject:

Finally, Google recently released an illustrative animation visualizing GPU memory utilization during LLM inference. You can find it here:

To ensure complete reproducibility of the platform, you can follow these steps. First, provision a Google Kubernetes Engine (GKE) Autopilot cluster, specifying your desired network and subnet. The gcloud command for this step is provided below:

gcloud container clusters create-auto vllm-cluster \
--region "us-central1" \
--release-channel "regular" \
--tier "standard" \
--enable-dns-access \
--enable-ip-access \
--enable-master-global-access \
--no-enable-google-cloud-access \
--network  --subnetwork  \
--cluster-ipv4-cidr "/17" \
--monitoring=SYSTEM,DCGM \
--auto-monitoring-scope=ALL \
--binauthz-evaluation-mode=DISABLED \
--scopes=https://www.googleapis.com/auth/cloud-platform

Next, apply the following Kubernetes manifest. This manifest configures several essential components:

A StorageClass for Hyperdisk ML persistent volumes.
The vLLM namespace.
A ClusterIP service for the vLLM deployment.
A PersistentVolumeClaim to store the model data.
A secret containing your Hugging Face access token (replace placeholder with your actual token).

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: hyperdisk-ml
parameters:
type: hyperdisk-ml
provisioned-throughput-on-create: "2000Mi"
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: false
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
mountOptions:
- read_ahead_kb=4096
---
apiVersion: v1
kind: Namespace
metadata:
name: vllm
---
apiVersion: v1
kind: Service
metadata:
namespace: vllm
name: llm-service
spec:
selector:
app: vllm-openai-api
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
namespace: vllm
name: hdml-static-pvc
spec:
storageClassName: "hyperdisk-ml"
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
namespace: vllm
type: Opaque
stringData:
token: ""

With the infrastructure in place, deploying vLLM is straightforward. The following manifest deploys vLLM version 0.8.3 on a single NVIDIA H100 GPU.

---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: vllm
name: vllm-gemma-deployment
spec:
replicas: 1
selector:
matchLabels:
app: vllm-openai-api
template:
metadata:
labels:
app: vllm-openai-api
ai.gke.io/model: gemma-3-27b-it
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: user-guide
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.3
resources:
requests:
cpu: "8"
memory: "128Gi"
ephemeral-storage: "120Gi"
nvidia.com/gpu : "1"
limits:
cpu: "8"
memory: "128Gi"
ephemeral-storage: "120Gi"
nvidia.com/gpu : "1"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=1
- --host=0.0.0.0
- --port=8000
- --gpu-memory-utilization=0.90
- --trust-remote-code
- --enable-chunked-prefill
- --dtype=bfloat16
- --swap-space=16
env:
- name: MODEL_ID
value: google/gemma-3-27b-it
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: hdml-static
mountPath: /root/.cache/huggingface
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: hdml-static
persistentVolumeClaim:
claimName: hdml-static-pvc
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-h100-80gb
cloud.google.com/gke-gpu-driver-version: latest
topology.kubernetes.io/zone: "us-central1-a"
terminationGracePeriodSeconds: 15

To further optimize costs, consider deploying vLLM on a Spot VM. Achieve this by adding the following annotation to the nodeSelector section of your vLLM deployment: cloud.google.com/gke-spot: “true”. Having worked extensively with Kubernetes for the past seven years, I continue to be amazed by the simplicity and reliability of GKE Autopilot. It consistently delivers a smooth and efficient operational experience and it works like a charm.

To evaluate performance, the vLLM benchmark was employed. This benchmark focuses on key metrics:

Throughput: Measured in tokens generated per second.
Latency: Encompassing Time to First Token (TTFT), Time per Output Token (TPOT), and Inter-token Latency.
Customizable Dataset Support: Enabling performance assessment across various conversational scenarios.

For this analysis, the ShareGPT dataset was utilized, a collection of real-world conversations between users and LLMs.

Sourced from users voluntarily sharing their chat logs via the ShareGPT browser extension, it offers several benefits:

Realistic Dialogues: Captures diverse user queries, prompts, and model responses, reflecting practical LLM usage.
Broad Topic Coverage: Spans numerous subjects, including coding, writing, creative brainstorming, and general knowledge.
Real-world Interactions: Provides insights into how individuals are actually interacting with LLMs.

Running the vLLM benchmark can be done through a Kubernetes manifest that we wrote for this work:

apiVersion: v1
kind: Pod
metadata:
name: vllm-benchmark
namespace: vllm
spec:
containers:
- name: vllm-benchmark
image: python:3.11-bookworm
resources:
requests:
cpu: "4"
memory: "8Gi"
ephemeral-storage: "10Gi"
limits:
cpu: "4"
memory: "8Gi"
ephemeral-storage: "10Gi"
command: ["/bin/bash", "-c"]
args:
- |
cd /data
mkdir -p /data/tmp
export TMPDIR=/data/tmp[ ! -f "/data/${DATASET_FILE}" ] && curl -L -o /data/"${DATASET_FILE}" "${DATASET_URL}"
python3 -m venv vllm-benchmark
source vllm-benchmark/bin/activate
git clone https://github.com/vllm-project/vllm.git || true
cd vllm && git checkout v${VLLM_VERSION}
pip install --no-cache-dir -r requirements/test.txt
pip install --no-cache-dir vllm==${VLLM_VERSION}
echo "Init Completed"
echo "kubectl -n vllm exec -it vllm-benchmark -- bash -c \" \\"
echo "  source /data/vllm-benchmark/bin/activate "
echo "  python3 /data/vllm/benchmarks/benchmark_serving.py \\"
echo "  --host=llm-service --port=8000 \\"
echo "  --dataset-name=sharegpt \\"
echo "  --dataset-path=/data/${DATASET_FILE} \\"
echo "  --tokenizer=google/gemma-3-27b-it \\"
echo "  --request-rate=500 --backend=openai \\"
echo "  --num-prompts=1000 --sharegpt-output-len 256 \\"
echo "  --save-result --model=google/gemma-3-27b-it\""
sleep infinity
env:
- name: DATASET_FILE
value: ShareGPT_V3_unfiltered_cleaned_split.json
- name: DATASET_URL
value: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
- name: VLLM_VERSION
value: 0.8.2
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: vllm-benchmark-pvc
nodeSelector:
topology.kubernetes.io/zone: "us-central1-a"
terminationGracePeriodSeconds: 5
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-benchmark-pvc
namespace: vllm
spec:
accessModes:
- ReadWriteOnce
storageClassName: "premium-rwo"
resources:
requests:
storage: 20Gi

After deployment, the benchmark is controlled via command-line arguments. The --request-rate parameter defines the number of parallel requests, while --num-prompts specifies the total number of prompts to process. For example, --request-rate=1 --num-prompts=10 executes 10 prompts sequentially, one at a time.

Fun-fact, I deeply love specialty coffee and one of my favorite roastery in Europe is A Matter Of Concrete based in Rotterdam.

This brings us to a crucial question: While GPUs excel at handling individual inference requests with minimal overhead, how do they perform under concurrent load?

Source Credit: https://medium.com/google-cloud/optimize-gemma-3-inference-vllm-on-gke-c071a08f7c78?source=rss—-e52cf94d98af—4