Cloud Run + GPU = Serverless LLMs! | by George Mao | Google Cloud - Community

Before we begin, here are a few things that Cloud Run will require when you enable Nvidia L4 GPUs (24GB):

Minimum configuration is 4 vCPU and 16G RAM
Instance billing is required (instead of request billing)

Gemma3 has 4 different model sizes (1B, 4B, 12B, 27B) — we will be using the 4B model size but you should be able to deploy the 12B model as well.

Here’s what we’ll do at a high level:

Use Ollama to serve the Gemma3:4B model on Cloud Run
Store the model on a GCS Bucket and mount it as a Volume to Cloud Run
Use OpenWebUI to interact with the model and provide a UI for users
Deploy and host as a 100% serverless capablity on Cloud Run

Enable the required APIs:

gcloud services enable run.googleapis.com \
cloudbuild.googleapis.com \
storage.googleapis.com \
artifactregistry.googleapis.com

Download the Gemma3 4B model

Create a directory to work in:

mkdir serverless-gemma-demo
cd serverless-gemma-demo

Download Ollama via their website or run command line:

curl -fsSL https://ollama.com/install.sh | sh

Pull the Gemma3 4B Model & test it:

ollama serve
ollama pull gemma3:4b
ollama run gemma3

Push the model to a GCS Bucket

You can do this via the console or CLI:

gcloud storage buckets create gs://[your gcs bucket]
gsutil cp -r ~/.ollama/models gs://[your gcs bucket]

Create the LLM Image for Cloud Run (Ollama + Gemma3)

Create a file called: Dockerfile. It’s contents should be:

FROM ollama/ollama# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434
# Store model weight files in /models
ENV OLLAMA_MODELS /models
# Reduce logging verbosity
ENV OLLAMA_DEBUG false
# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1 
# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL 
# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Build & Push the Image to Artifact registry:

gcloud artifacts repositories create ollama-sidecar-gemma-repo --repository-format=docker \
--location=us-central1 --description="Serverless Ollama + OpenWebUI Gemma demo" \
--project=[your gcp project id]

gcloud builds submit \
--tag us-central1-docker.pkg.dev/[your gcp project id]/ollama-sidecar-gemma-repo/ollama-gemma-4b \
--machine-type e2-highcpu-32

Create the OpenWebUI Image for Cloud Run

Pull the image from Dockerhub. Make sure you specify the platform as linux/amd64. Cloud Run requires an x86 based image.

docker pull --platform linux/amd64 ghcr.io/open-webui/open-webui:main

Configure Docker to use your GCP creds so you can push the image to AR:

gcloud auth configure-docker us-central1-docker.pkg.dev

Tag and push the image to AR:

docker tag ghcr.io/open-webui/open-webui:main us-central1-docker.pkg.dev/[your gcp project id]]/ollama-sidecar-gemma-repo/openwebui
docker push us-central1-docker.pkg.dev/[your gcp project id]/ollama-sidecar-gemma-repo/openwebui

Deploy the Cloud Run Service

Create a service.yaml file to define the service. We’ll be deploying two containers on the same Cloud Run Service. One for the OpenWebUI and another for the Ollama / Gemma3 LLM. Review the service.yaml file here. It’s currently configured to disable GPU Zonal Redundancy and allow a max of 5 concurrent request per Run instance.

gcloud  run services replace service.yaml

Deployment will take 5–8 minutes and result in a service url you can use to test it out.

https://ollama-sidecar-gemma-demo-[project id].us-central1.run.app

Here’s what you should see:

Try asking it a few different prompts. Start with a text based prompt like:

Explain a SQL Database. Respond as if you were talking to a child. Keep it concise.

Source Credit: https://medium.com/google-cloud/cloud-run-gpu-serverless-llms-d569ffc4af50?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Build a stronger cloud infrastructure with Azure IaaS

Welcoming Wiz to Google Cloud: Redefining security for the AI era

Introducing GPT-5.4 in Microsoft Foundry

You may have missed

Build a stronger cloud infrastructure with Azure IaaS

Welcoming Wiz to Google Cloud: Redefining security for the AI era

How to create a mood tracker (No code)

Run NVIDIA Nemotron 3 Nano as a fully managed serverless model on Amazon Bedrock