Before we begin, here are a few things that Cloud Run will require when you enable Nvidia L4 GPUs (24GB):
- Minimum configuration is 4 vCPU and 16G RAM
- Instance billing is required (instead of request billing)
Gemma3 has 4 different model sizes (1B, 4B, 12B, 27B) — we will be using the 4B model size but you should be able to deploy the 12B model as well.
Here’s what we’ll do at a high level:
- Use Ollama to serve the Gemma3:4B model on Cloud Run
- Store the model on a GCS Bucket and mount it as a Volume to Cloud Run
- Use OpenWebUI to interact with the model and provide a UI for users
- Deploy and host as a 100% serverless capablity on Cloud Run
Enable the required APIs:
gcloud services enable run.googleapis.com \
cloudbuild.googleapis.com \
storage.googleapis.com \
artifactregistry.googleapis.com
Download the Gemma3 4B model
Create a directory to work in:
mkdir serverless-gemma-demo
cd serverless-gemma-demo
Download Ollama via their website or run command line:
curl -fsSL https://ollama.com/install.sh | sh
Pull the Gemma3 4B Model & test it:
ollama serve
ollama pull gemma3:4b
ollama run gemma3
Push the model to a GCS Bucket
You can do this via the console or CLI:
gcloud storage buckets create gs://[your gcs bucket]
gsutil cp -r ~/.ollama/models gs://[your gcs bucket]
Create the LLM Image for Cloud Run (Ollama + Gemma3)
Create a file called: Dockerfile. It’s contents should be:
FROM ollama/ollama# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434
# Store model weight files in /models
ENV OLLAMA_MODELS /models
# Reduce logging verbosity
ENV OLLAMA_DEBUG false
# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1
# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL
# Start Ollama
ENTRYPOINT ["ollama", "serve"]
Build & Push the Image to Artifact registry:
gcloud artifacts repositories create ollama-sidecar-gemma-repo --repository-format=docker \
--location=us-central1 --description="Serverless Ollama + OpenWebUI Gemma demo" \
--project=[your gcp project id]
gcloud builds submit \
--tag us-central1-docker.pkg.dev/[your gcp project id]/ollama-sidecar-gemma-repo/ollama-gemma-4b \
--machine-type e2-highcpu-32
Create the OpenWebUI Image for Cloud Run
Pull the image from Dockerhub. Make sure you specify the platform as linux/amd64. Cloud Run requires an x86 based image.
docker pull --platform linux/amd64 ghcr.io/open-webui/open-webui:main
Configure Docker to use your GCP creds so you can push the image to AR:
gcloud auth configure-docker us-central1-docker.pkg.dev
Tag and push the image to AR:
docker tag ghcr.io/open-webui/open-webui:main us-central1-docker.pkg.dev/[your gcp project id]]/ollama-sidecar-gemma-repo/openwebui
docker push us-central1-docker.pkg.dev/[your gcp project id]/ollama-sidecar-gemma-repo/openwebui
Deploy the Cloud Run Service
Create a service.yaml file to define the service. We’ll be deploying two containers on the same Cloud Run Service. One for the OpenWebUI and another for the Ollama / Gemma3 LLM. Review the service.yaml file here. It’s currently configured to disable GPU Zonal Redundancy and allow a max of 5 concurrent request per Run instance.
gcloud run services replace service.yaml
Deployment will take 5–8 minutes and result in a service url you can use to test it out.
https://ollama-sidecar-gemma-demo-[project id].us-central1.run.app
Here’s what you should see:
Try asking it a few different prompts. Start with a text based prompt like:
Explain a SQL Database. Respond as if you were talking to a child. Keep it concise.
Source Credit: https://medium.com/google-cloud/cloud-run-gpu-serverless-llms-d569ffc4af50?source=rss—-e52cf94d98af—4
