Today, we are releasing Gemma 4 on Google Cloud.
What’s new: It is, byte for byte, the most capable family of open models. Built from the same research as Gemini 3 and released under a commercially permissive Apache 2.0 license, these models move beyond chat. With context windows up to 256K, native vision and audio processing, and fluency in over 140 languages, they excel at complex logic, offline code generation, and agentic workflows. Learn more about the model here.
Why it matters for your business: Enterprise AI requires models that execute complex logic while keeping data within secure boundaries. Gemma 4 gives you this balance. Organizations can deploy these models across Google Cloud to meet strict compliance guarantees, including Sovereign Cloud solutions. This provides a foundation for digital sovereignty, granting teams complete control over their data, infrastructure, and models.
Where you can get started with Gemma 4
Vertex AI
Deploy Gemma 4 to your own Vertex AI endpoints. Select the model from Model Garden and provision the specific compute resources your application requires. This self-deployment model gives you direct control over your serving infrastructure and costs while keeping your data within your Google Cloud environment.
You can also fine-tune Gemma 4 using Vertex AI Training Clusters (VTC), which offer optimized SFT recipes and high-scale resiliency through NVIDIA NeMo Megatron. This ensures you can efficiently adapt any variant, from the effective 2B (E2B) model for edge tasks to the 31B dense model for complex enterprise orchestration. Here’s an end to end guide for efficient fine-tuning and serving of the Gemma 4 31B model on Vertex AI.
Additionally, we’re committed to empowering customer choice and innovation through our curated collection of first-party, open, and third-party models available on Vertex AI. That’s why we’re thrilled to announce that Gemma 4 26B MoE model will be available as fully managed and serverless on Model Garden over the coming days.
Agent Development Kit (ADK)
ADK is a flexible and modular open-source framework for developing and deploying AI agents. Gemma 4 offers advanced agentic capabilities, including reasoning, function calling, code generation, and structured output. ADK helps you build fully functional AI agents with Gemma 4. Start building AI agents with Gemma 4 and Google ADK today.
Cloud Run
You can now run demanding Gemma 4 inference workloads efficiently on Cloud Run, leveraging the power of NVIDIA RTX PRO 6000 (Blackwell) GPUs. With 96GB of vGPU memory, you can easily deploy models like Gemma-4-31B-it on serverless GPUs.
Cloud Run handles the underlying infrastructure, allowing you to focus on your applications. Your models scale to zero when inactive and dynamically adjust with demand, ensuring optimized costs as you only pay for what you use. Plus, you have the flexibility to tailor CPU and memory configurations for each inference workload. Try it out now, on demand with no reservations, in us-central1 or europe-west4.
Google Kubernetes Engine (GKE)
GKE provides a highly scalable and customizable environment for deploying Gemma 4, perfect for teams that require fine-grained control over their AI infrastructure. By managing your own infrastructure on GKE, you gain the flexibility to tailor compute resources, select specific GPU or TPU accelerators, and implement custom autoscaling metrics that match your exact traffic patterns. This level of control also ensures your AI workloads can seamlessly integrate with your existing microservices while adhering to your organization’s strict security and data compliance requirements.
Starting today, you can efficiently serve Gemma 4 models on GKE using vLLM, a high-throughput and memory-efficient LLM serving engine. By leveraging GKE, you can seamlessly scale your inference workloads from zero to peak demand while optimizing your resource utilization and costs. To help you get started, check out our newly updated tutorial on how to serve Gemma 4 on GKE.
Looking ahead, Gemma 4 is uniquely positioned to power the next generation of agentic applications on Google Cloud. Pairing Gemma 4’s multi-step planning capabilities with the new GKE Agent Sandbox, developers can safely execute LLM-generated code and tool calls within highly isolated, Kubernetes-native environments that offer sub-second cold starts with up to 300 sandboxes per second for secure, efficient multi-step planning. Furthermore, by leveraging the GKE Inference Gateway and advanced distributed inference features in llm-d like predicted-latency-based scheduling, these complex workflows benefit from intelligent routing that dynamically balances cache reuse and server load. GKE Inference Gateway with Predictive Latency Boost can cut time-to-first-token (TTFT) latency by up to 70% by replacing heuristic guesswork with real-time capacity-aware routing, no manual tuning required.
Google Cloud TPUs
Gemma 4 will be available on TPUs across Google Cloud through GKE, GCE, and Vertex AI. Starting today, you can now use a number of popular open source TPU projects to serve, pretrain, and post-train Gemma-4-31B dense and Gemma-4-26B-A4B MoE.
-
For pretraining and post-training experimentation, you can leverage MaxText and perform post training to customize for text analysis and generation, reasoning and image analysis use cases.
-
For online serving and batch inference, you’ll be able to use vLLM TPU for your production workloads using our prebuilt docker containers, quickstart vision, and text demo tutorials.
Stay tuned for community-contributed SGLang-JAX tutorials.
Sovereign Cloud
Gemma 4 will be available across all our Sovereign Cloud offerings, including public cloud with Data Boundary, Google Cloud Dedicated (such as S3NS in France), and Google Distributed Cloud for air-gapped and on-premises deployments. This expansion reinforces our commitment to an open, sovereign digital world where organizations maintain total control over their data, encryption, and operational environment.
By providing open weights, Gemma 4 empowers developers to build specialized solutions for highly sensitive environments. Enterprise and government agencies can now deploy localized services that respect regional nuances and domain expertise while meeting strict data residency and sovereignty rules. This approach ensures that organizations can innovate rapidly with AI while remaining fully compliant with national and industry requirements.
Get started today
From Vertex AI to Sovereign Cloud, you can start building with Gemma 4 today. By choosing Gemma 4 on Google Cloud, enterprises and sovereign organizations gain a trusted, transparent foundation that delivers state-of-the-art capabilities while meeting the highest standards for security and reliability.
Source Credit: https://cloud.google.com/blog/products/ai-machine-learning/gemma-4-available-on-google-cloud/
