Real-Time AI Speech with Serverless GPUs
TL;DR: Learn how to deploy Mistral AI’s Voxtral Mini model on Google Cloud Run with NVIDIA L4 GPUs for real-time speech and text inference that scales to zero when inactive.
Introduction
The AI landscape is evolving rapidly, and one of the most exciting developments is the emergence of multimodal models that can process both text and audio in real-time. Enter Voxtral Mini Realtime— Mistral AI’s latest innovation that brings real-time speech capabilities to a compact 4-billion parameter model.
But here’s the challenge: how do you deploy such a model cost-effectively while maintaining the performance needed for real-time interactions? The answer lies in serverless GPUs on Google Cloud Run.
In this guide, I’ll show you how to deploy Voxtral Mini on Cloud Run with full GPU acceleration, achieving a serverless architecture that:
- Scales to zero when not in use (no idle costs)
- Spins up in seconds when requests arrive
- Supports WebSocket for real-time audio streaming
- Exposes OpenAI-compatible API for easy integration
What is Voxtral Mini?
Voxtral Mini is Mistral AI’s first real-time speech model, released in February 2026. Here’s what makes it special:
Key Features
- Real-Time Speech Understanding: Native audio processing without transcription bottlenecks
- Bidirectional Streaming: Simultaneous audio input and output via WebSocket
- Compact Size: 4B parameters — small enough for efficient inference
- Multimodal: Handles text, audio, or both simultaneously
- Low Latency: Optimized for real-time conversational AI
Technical Specifications
- Architecture: Transformer-based with audio tokenization
- Context Length: Up to 45,000 tokens
- Audio Support: 16kHz PCM, streaming chunks
- Inference Engine: vLLM for optimized GPU utilization
- API: OpenAI-compatible endpoints + custom /v1/realtime WebSocket
Why Cloud Run with GPUs?
Google Cloud Run recently added GPU support (preview), making it possible to run ML models serverless-style with GPU acceleration. This is a game-changer for AI deployments.
NVIDIA L4 GPU on Cloud Run
The L4 GPU is Cloud Run’s currently supported accelerator:
- VRAM: 24 GB GDDR6
- Architecture: Ada Lovelace (Ampere successor)
- Performance: ~120 TFLOPs FP16
- Power: 72W (energy efficient)
- Perfect for inference workloads like Voxtral Mini!
When to Use Cloud Run GPU
✅ Great for:
- Intermittent inference (APIs, batch jobs)
- Development and testing
- Proof-of-concepts
- Cost-sensitive deployments
- Traffic with natural spikes/valleys
❌ Not ideal for:
- 24/7 always-on services (better with GKE autopilot)
- Ultra-low-latency requirements (<10ms)
- Custom GPU configurations
- Multi-GPU setups (Cloud Run supports 1 GPU per instance)
Deployment Guide
Let’s deploy Voxtral Mini step by step.
Prerequisites
Before starting, ensure you have:
- Google Cloud account with billing enabled
- gcloud CLI installed: curl https://sdk.cloud.google.com | bash
- HuggingFace token with access to the model:
- Get token at: https://huggingface.co/settings/tokens
- Request access: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602
Step 1: Clone the Repository
git clone https://github.com/miticojo/voxtral-cloudrun.git
cd voxtral-cloudrun
The project includes:
- Dockerfile — vLLM container with Voxtral Mini
- cloudbuild.yaml — Cloud Build configuration
- scripts/deploy.sh — Automated deployment script
- examples/ — Usage examples (curl, Python, WebSocket)
Step 2: Configure Google Cloud
Authenticate and set your project:
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
[Change YOUR_PROJECT_ID with your GCP Project ID]
Store your HuggingFace token in Secret Manager (secure credential storage):
echo -n "YOUR_HF_TOKEN" | gcloud secrets create hf-token --data-file=-
Step 3: Deploy with One Command
Run the deployment script:
./scripts/deploy.sh
The script automatically:
- Enables required GCP APIs (Cloud Run, Artifact Registry, Secret Manager)
- Creates Docker repository in Artifact Registry
- Builds container with Cloud Build (~10–15 minutes)
- Pre-downloads Voxtral Mini model at build time
- Deploys to Cloud Run with L4 GPU
- Configures health checks and autoscaling
Build time explained: The long build time is due to downloading the 4B parameter model and its dependencies. This happens once per build, and the model is cached in the container image.
Step 4: Verify Deployment
Once deployed, get your service URL:
SERVICE_URL=$(gcloud run services describe voxtral-mini \
--region=europe-west1 \
--format='value(status.url)')
echo "Service URL: $SERVICE_URL"
Test the health endpoint:
curl $SERVICE_URL/health
# Expected: {"status": "ok"}
List available models:
curl $SERVICE_URL/v1/models
Using the API
Voxtral Mini exposes an OpenAI-compatible API, making it easy to integrate with existing tools.
Text Completion
curl https://YOUR_SERVICE_URL/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Voxtral-Mini-4B-Realtime-2602",
"prompt": "Explain serverless computing in simple terms:",
"max_tokens": 100,
"temperature": 0.7
}'
Chat Completion (Streaming)
curl https://YOUR_SERVICE_URL/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Voxtral-Mini-4B-Realtime-2602",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the benefits of Cloud Run?"}
],
"stream": true
}'
Python Client
from openai import OpenAI
client = OpenAI(
base_url="https://YOUR_SERVICE_URL/v1",
api_key="not-needed" # vLLM doesn't require API key
)
response = client.chat.completions.create(
model="mistralai/Voxtral-Mini-4B-Realtime-2602",
messages=[
{"role": "user", "content": "Write a haiku about cloud computing"}
]
)
print(response.choices[0].message.content)
Real-Time Audio (WebSocket)
The real power of Voxtral Mini is the /v1/realtime WebSocket endpoint:
const ws = new WebSocket('wss://YOUR_SERVICE_URL/v1/realtime');
ws.onopen = () => {
console.log('Connected!');
// Send audio chunks as base64-encoded PCM
ws.send(JSON.stringify({
type: 'audio',
data: audioBase64
}));
};
ws.onmessage = (event) => {
const response = JSON.parse(event.data);
console.log('Received:', response);
};
Check examples/test-realtime.html in the repository for a complete browser-based demo.
Cost Optimization Tips
- Set min-instances: 0 for true serverless (default in our deployment)
- Use max-instances to cap costs: –max-instances 2
- Adjust concurrency to handle more requests per instance: –concurrency 4
- Monitor cold starts and balance cost vs. latency
- Use Cloud Monitoring to track GPU utilization
Production Best Practices
1. Enable IAM Authentication
For production, disable public access:
gcloud run services update voxtral-mini \
--no-allow-unauthenticated \
--region=europe-west1
Call with identity token:
TOKEN=$(gcloud auth print-identity-token)
curl -H "Authorization: Bearer $TOKEN" \
https://YOUR_SERVICE_URL/v1/models
2. Set Up Monitoring
Enable structured logging:
import logging
logging.basicConfig(level=logging.INFO, format='%(message)s')
logging.info('Request processed', extra={'duration_ms': 123})
Create alerts in Cloud Console:
- High error rate (>5%)
- P95 latency >2s
- GPU utilization <20% (underutilized)
3. Optimize Cold Starts
Reduce cold start time with:
- CPU boost: –cpu-boost (already enabled)
- Larger machine: More CPU speeds up container start
- Minimal dependencies: Slim Docker images
- Health checks: Proper startup probes prevent premature traffic
4. Implement Rate Limiting
Add rate limiting middleware:
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@app.post("/v1/chat/completions")
@limiter.limit("10/minute")
async def chat_completion(request: Request):
# Your logic here
Troubleshooting
Issue: CUDA out of memory
Symptom: Runtime error about GPU memory
Solution: Reduce –gpu-memory-utilization in Dockerfile:
CMD ["--gpu-memory-utilization", "0.85"] # Default is 0.90
Issue: Cold start timeout
Symptom: First request after idle fails
Solution: Increase startup probe timeout:
--startup-probe="tcpSocket.port=8000,failureThreshold=60"
Issue: GPU quota exceeded
Symptom: Deployment fails with quota error
Solution: Request L4 GPU quota increase: https://console.cloud.google.com/iam-admin/quotas
Filter for NVIDIA_L4_GPUS and request increase.
Conclusion
Deploying Voxtral Mini on Google Cloud Run with GPU acceleration combines the best of both worlds:
Serverless convenience: No infrastructure management
GPU performance: Real-time inference with NVIDIA L4
Cost efficiency: Pay only for actual usage
Production-ready: IAM, monitoring, CI/CD included
The full code is available on GitHub: voxtral-cloudrun
Next Steps
- Try the deployment with your own project
- Experiment with the WebSocket API for real-time audio
- Integrate with your application using the OpenAI-compatible SDK
- Monitor costs and performance in Cloud Console
- Star the repo if you found this helpful! ⭐
Resources
- Voxtral Mini Model Card
- vLLM Documentation
- Cloud Run GPU Docs
- GitHub Repository
Questions or feedback? Drop a comment below or open an issue on GitHub!
Found this useful? Follow me for more AI deployment guides and cloud architecture content.
Happy STT! 🚀
Run Voxtral Mini 4B Realtime on Google Cloud Run was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/run-voxtral-mini-4b-realtime-on-google-cloud-run-6f4a7a5e9310?source=rss—-e52cf94d98af—4
