Run Voxtral Mini 4B Realtime on Google Cloud Run

Real-Time AI Speech with Serverless GPUs

TL;DR: Learn how to deploy Mistral AI’s Voxtral Mini model on Google Cloud Run with NVIDIA L4 GPUs for real-time speech and text inference that scales to zero when inactive.

Introduction

The AI landscape is evolving rapidly, and one of the most exciting developments is the emergence of multimodal models that can process both text and audio in real-time. Enter Voxtral Mini Realtime— Mistral AI’s latest innovation that brings real-time speech capabilities to a compact 4-billion parameter model.

But here’s the challenge: how do you deploy such a model cost-effectively while maintaining the performance needed for real-time interactions? The answer lies in serverless GPUs on Google Cloud Run.

In this guide, I’ll show you how to deploy Voxtral Mini on Cloud Run with full GPU acceleration, achieving a serverless architecture that:

Scales to zero when not in use (no idle costs)
Spins up in seconds when requests arrive
Supports WebSocket for real-time audio streaming
Exposes OpenAI-compatible API for easy integration

What is Voxtral Mini?

Voxtral Mini is Mistral AI’s first real-time speech model, released in February 2026. Here’s what makes it special:

Key Features

Real-Time Speech Understanding: Native audio processing without transcription bottlenecks
Bidirectional Streaming: Simultaneous audio input and output via WebSocket
Compact Size: 4B parameters — small enough for efficient inference
Multimodal: Handles text, audio, or both simultaneously
Low Latency: Optimized for real-time conversational AI

Technical Specifications

Architecture: Transformer-based with audio tokenization
Context Length: Up to 45,000 tokens
Audio Support: 16kHz PCM, streaming chunks
Inference Engine: vLLM for optimized GPU utilization
API: OpenAI-compatible endpoints + custom /v1/realtime WebSocket

Why Cloud Run with GPUs?

Google Cloud Run recently added GPU support (preview), making it possible to run ML models serverless-style with GPU acceleration. This is a game-changer for AI deployments.

NVIDIA L4 GPU on Cloud Run

The L4 GPU is Cloud Run’s currently supported accelerator:

VRAM: 24 GB GDDR6
Architecture: Ada Lovelace (Ampere successor)
Performance: ~120 TFLOPs FP16
Power: 72W (energy efficient)
Perfect for inference workloads like Voxtral Mini!

When to Use Cloud Run GPU

✅ Great for:

Intermittent inference (APIs, batch jobs)
Development and testing
Proof-of-concepts
Cost-sensitive deployments
Traffic with natural spikes/valleys

❌ Not ideal for:

24/7 always-on services (better with GKE autopilot)
Ultra-low-latency requirements (<10ms)
Custom GPU configurations
Multi-GPU setups (Cloud Run supports 1 GPU per instance)

Deployment Guide

Let’s deploy Voxtral Mini step by step.

Prerequisites

Before starting, ensure you have:

Google Cloud account with billing enabled
gcloud CLI installed: curl https://sdk.cloud.google.com | bash
HuggingFace token with access to the model:

Get token at: https://huggingface.co/settings/tokens
Request access: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602

Step 1: Clone the Repository

git clone https://github.com/miticojo/voxtral-cloudrun.git
cd voxtral-cloudrun

The project includes:

Dockerfile — vLLM container with Voxtral Mini
cloudbuild.yaml — Cloud Build configuration
scripts/deploy.sh — Automated deployment script
examples/ — Usage examples (curl, Python, WebSocket)

Step 2: Configure Google Cloud

Authenticate and set your project:

gcloud auth login
gcloud config set project YOUR_PROJECT_ID

[Change YOUR_PROJECT_ID with your GCP Project ID]

Store your HuggingFace token in Secret Manager (secure credential storage):

echo -n "YOUR_HF_TOKEN" | gcloud secrets create hf-token --data-file=-

Step 3: Deploy with One Command

Run the deployment script:

./scripts/deploy.sh

The script automatically:

Enables required GCP APIs (Cloud Run, Artifact Registry, Secret Manager)
Creates Docker repository in Artifact Registry
Builds container with Cloud Build (~10–15 minutes)
Pre-downloads Voxtral Mini model at build time
Deploys to Cloud Run with L4 GPU
Configures health checks and autoscaling

Build time explained: The long build time is due to downloading the 4B parameter model and its dependencies. This happens once per build, and the model is cached in the container image.

Step 4: Verify Deployment

Once deployed, get your service URL:

SERVICE_URL=$(gcloud run services describe voxtral-mini \
  --region=europe-west1 \
  --format='value(status.url)')

echo "Service URL: $SERVICE_URL"

Test the health endpoint:

curl $SERVICE_URL/health
# Expected: {"status": "ok"}

List available models:

curl $SERVICE_URL/v1/models

Using the API

Voxtral Mini exposes an OpenAI-compatible API, making it easy to integrate with existing tools.

Text Completion

curl https://YOUR_SERVICE_URL/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Voxtral-Mini-4B-Realtime-2602",
    "prompt": "Explain serverless computing in simple terms:",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Chat Completion (Streaming)

curl https://YOUR_SERVICE_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Voxtral-Mini-4B-Realtime-2602",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What are the benefits of Cloud Run?"}
    ],
    "stream": true
  }'

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="https://YOUR_SERVICE_URL/v1",
    api_key="not-needed"  # vLLM doesn't require API key
)

response = client.chat.completions.create(
    model="mistralai/Voxtral-Mini-4B-Realtime-2602",
    messages=[
        {"role": "user", "content": "Write a haiku about cloud computing"}
    ]
)

print(response.choices[0].message.content)

Real-Time Audio (WebSocket)

The real power of Voxtral Mini is the /v1/realtime WebSocket endpoint:

const ws = new WebSocket('wss://YOUR_SERVICE_URL/v1/realtime');

ws.onopen = () => {
    console.log('Connected!');
    
    // Send audio chunks as base64-encoded PCM
    ws.send(JSON.stringify({
        type: 'audio',
        data: audioBase64
    }));
};

ws.onmessage = (event) => {
    const response = JSON.parse(event.data);
    console.log('Received:', response);
};

Check examples/test-realtime.html in the repository for a complete browser-based demo.

Cost Optimization Tips

Set min-instances: 0 for true serverless (default in our deployment)
Use max-instances to cap costs: –max-instances 2
Adjust concurrency to handle more requests per instance: –concurrency 4
Monitor cold starts and balance cost vs. latency
Use Cloud Monitoring to track GPU utilization

Production Best Practices

1. Enable IAM Authentication

For production, disable public access:

gcloud run services update voxtral-mini \
  --no-allow-unauthenticated \
  --region=europe-west1

Call with identity token:

TOKEN=$(gcloud auth print-identity-token)
curl -H "Authorization: Bearer $TOKEN" \
  https://YOUR_SERVICE_URL/v1/models

2. Set Up Monitoring

Enable structured logging:

import logging
logging.basicConfig(level=logging.INFO, format='%(message)s')
logging.info('Request processed', extra={'duration_ms': 123})

Create alerts in Cloud Console:

High error rate (>5%)
P95 latency >2s
GPU utilization <20% (underutilized)

3. Optimize Cold Starts

Reduce cold start time with:

CPU boost: –cpu-boost (already enabled)
Larger machine: More CPU speeds up container start
Minimal dependencies: Slim Docker images
Health checks: Proper startup probes prevent premature traffic

4. Implement Rate Limiting

Add rate limiting middleware:

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/v1/chat/completions")
@limiter.limit("10/minute")
async def chat_completion(request: Request):
    # Your logic here

Troubleshooting

Issue: CUDA out of memory

Symptom: Runtime error about GPU memory

Solution: Reduce –gpu-memory-utilization in Dockerfile:

CMD ["--gpu-memory-utilization", "0.85"]  # Default is 0.90

Issue: Cold start timeout

Symptom: First request after idle fails

Solution: Increase startup probe timeout:

--startup-probe="tcpSocket.port=8000,failureThreshold=60"

Issue: GPU quota exceeded

Symptom: Deployment fails with quota error

Solution: Request L4 GPU quota increase: https://console.cloud.google.com/iam-admin/quotas

Filter for NVIDIA_L4_GPUS and request increase.

Conclusion

Deploying Voxtral Mini on Google Cloud Run with GPU acceleration combines the best of both worlds:

Serverless convenience: No infrastructure management
GPU performance: Real-time inference with NVIDIA L4
Cost efficiency: Pay only for actual usage
Production-ready: IAM, monitoring, CI/CD included

The full code is available on GitHub: voxtral-cloudrun

Next Steps

Try the deployment with your own project
Experiment with the WebSocket API for real-time audio
Integrate with your application using the OpenAI-compatible SDK
Monitor costs and performance in Cloud Console
Star the repo if you found this helpful! ⭐

Resources

Voxtral Mini Model Card
vLLM Documentation
Cloud Run GPU Docs
GitHub Repository

Questions or feedback? Drop a comment below or open an issue on GitHub!

Found this useful? Follow me for more AI deployment guides and cloud architecture content.

Happy STT! 🚀

Run Voxtral Mini 4B Realtime on Google Cloud Run was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/run-voxtral-mini-4b-realtime-on-google-cloud-run-6f4a7a5e9310?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

Amazon EC2 Hpc8a Instances powered by 5th Gen AMD EPYC processors are now available

Ship Production Ready AI and Survive the Multimodal Frontier This February

Gemini Interactions API: A Unified Interface for Models and Agents

You may have missed

Amazon EC2 Hpc8a Instances powered by 5th Gen AMD EPYC processors are now available