
You’ve built a brilliant LLM-powered application. It works perfectly on your laptop. Users love the demo. Then you deploy to production, and reality hits hard:
- API calls fail silently during high load
- Regional outages take down your entire service
- Rate limits are exceeded without warning
- Errors cascade through your system
- You have no visibility into what’s happening
Sound familiar? You’re not alone. The gap between a working prototype and a production-ready LLM application is vast, and it’s littered with the remains of promising projects that couldn’t handle real-world conditions.
The good news? We can bridge this gap by applying Site Reliability Engineering (SRE) principles that have kept massive distributed systems running for decades.
This article is not about promoting a specific library. Instead, I want to share the lessons I learned while building production-ready LLM applications and show you concrete patterns you can implement in your own codebase. I built a reference implementation (gemini-sre-client) to demonstrate these patterns in action, but the principles apply regardless of your tech stack or which LLM provider you use.
Feel free to use my code as inspiration, copy what works for you, or build something entirely different. The goal is to help you think about reliability from day one, not to convince you to use a particular library.
What You’ll Learn:
- The 5 critical SRE patterns for LLM applications
- Why each pattern matters (with real numbers)
- How to implement them (multiple approaches)
- When to use each pattern (and when not to)
- A pragmatic roadmap for adding reliability incrementally
Who This Is For:
- Developers moving LLM prototypes to production
- Engineering teams building LLM-powered services
- SREs supporting AI/ML workloads
- Technical leaders defining reliability standards
Let’s dive in.
The Problem: LLM Applications Are Distributed Systems in Disguise
When you call an LLM API, you’re not just making a simple HTTP request. You’re:
- Hitting an external dependency that you don’t control
- Depending on regional infrastructure that can fail
- Consuming a quota-limited resource that can throttle
- Waiting for variable latencies from milliseconds to minutes
- Parsing complex responses that can be malformed
Traditional error handling isn’t enough. You need the same reliability patterns that power Netflix, Google, and AWS.
The challenge is that most LLM SDKs give you the basics — authentication, request formatting, response parsing — but leave reliability as “an exercise for the reader.” This is fine for prototypes, but dangerous for production.
The Solution: Build Your Own Reliability Layer
Let’s explore five critical SRE patterns every production LLM application needs. I’ll show you the theory behind each pattern, why it matters, and concrete implementation approaches.
Throughout this article, I’ll reference my own implementation (gemini-sre-client) as a working example, but these patterns are universal — you can implement them in any language, with any LLM provider, in whatever way fits your architecture best.
1. Automatic Retry with Exponential Backoff
The Problem: Network hiccups and transient API errors cause requests to fail unnecessarily.
The Pattern: Automatically retry failed requests with increasing delays between attempts.
Why It Matters: Studies show that 70–80% of transient failures resolve within seconds. Without retries, these become user-facing errors. But naive retries can overwhelm failing services — exponential backoff gives systems time to recover.
Key Design Decisions:
When implementing retry logic, you need to decide:
- How many times to retry? (typically 3–5 attempts)
- Initial delay? (usually 1–2 seconds)
- Maximum delay? (cap at 16–32 seconds to avoid hanging)
- Backoff multiplier? (2x is standard: 1s, 2s, 4s, 8s, 16s)
- Which errors to retry? (only transient: 429, 503, timeouts, not 400/401)
- How to handle different error types?
- 429 (Rate Limit): Don’t retry same region — switch regions immediately if available
- 503 (Service Unavailable): Retry with backoff, might recover quickly
- Timeout: Retry with backoff, but consider aggressive timeout values
- 400/401/403: Don’t retry, these are client errors that won’t resolve
One Possible Implementation:
Here’s how I approached this in my reference implementation:
from gemini_sre import GeminiSREClient
from gemini_sre.core import RetryConfigclient = GeminiSREClient(
project_id="your-project",
locations=["us-central1"],
retry_config=RetryConfig(
max_attempts=5, # Try up to 5 times
initial_delay=1.0, # Wait 1s after first failure
max_delay=16.0, # Cap at 16s between attempts
multiplier=2.0, # Double delay each time (1s, 2s, 4s, 8s, 16s)
),
)
# This request automatically retries on transient failures
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain quantum computing",
request_id="quantum-001", # For tracing
)
Alternative Approaches:
You could also:
- Use a library like
tenacity
orbackoff
in Python - Implement retry logic in middleware/decorators
- Use your API gateway’s built-in retry policies (e.g., Envoy, Kong)
- Build custom retry logic specific to your error handling strategy
The key is having some retry mechanism, not necessarily this exact implementation.
2. Multi-Region Failover
The Problem: Regional outages happen. AWS regions go down. Google Cloud regions experience degraded performance. Your single-region application becomes unavailable.
But there’s another critical issue: quota limits. Even when a region is perfectly healthy, you might hit your rate limit (e.g., 60 requests per minute). If all your traffic goes to one region, you’re artificially capping your throughput at that region’s quota — even though you have unused capacity in other regions.
The Pattern: Configure multiple regions and intelligently distribute load across them. Fail over when one becomes unavailable OR when you hit quota limits.
Why It Matters:
For availability: Most cloud providers publish regional availability around 99.5% per region. Using two independent regions, your theoretical availability jumps to 99.9975%. That’s the difference between 43 hours of downtime per year versus 13 minutes.
For capacity: If you have a quota of 60 requests/minute per region, using 3 regions gives you 180 requests/minute total capacity. Without multi-region support, you’re leaving 120 requests/minute on the table.
Key Design Decisions:
When implementing multi-region support, consider:
- Region selection — Choose geographically diverse regions (different continents if possible)
- Distribution strategy — Failover only? Round-robin load balancing? Quota-aware routing?
- Failover triggers — What conditions cause a switch? (timeouts, error rates, 429 rate limits, circuit breaker)
- Quota management — How to track and distribute capacity across regions?
- Data locality — Are there compliance concerns with certain regions?
- Cost implications — Different regions have different pricing
Common Multi-Region Strategies:
Strategy 1: Primary + Failover (Simplest)
- Send all traffic to primary region
- Only use secondary when primary fails or returns 429 (rate limit)
- Good for: Basic reliability, simple quota overflow
Strategy 2: Round-Robin (Better capacity utilization)
- Distribute requests evenly across all regions
- Maximizes total quota (3 regions = 3x capacity)
- Good for: High-throughput applications
Strategy 3: Quota-Aware Routing (Most sophisticated)
- Track quota usage per region in real-time
- Route to least-utilized region
- Dynamically adjust based on remaining capacity
- Good for: Applications with variable traffic patterns
One Possible Implementation:
Here’s a simple approach with ordered fallback that handles both failures and rate limits:
client = GeminiSREClient(
project_id="your-project",
locations=[
"us-central1", # Primary: Iowa
"europe-west1", # Secondary: Belgium
"asia-northeast1", # Tertiary: Tokyo
],
enable_circuit_breaker=True, # Automatically skip failing regions
)# The client automatically tries regions in order
# If us-central1 fails (500, timeout) -> tries europe-west1
# If us-central1 returns 429 (rate limit) -> tries europe-west1
# If europe-west1 fails -> tries asia-northeast1
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Write me a poem about resilience",
request_id="poem-001",
)
How Quota Management Works:
When you hit a 429 rate limit error, the system should:
- Immediately try the next region (don’t waste time retrying the same region)
- Optionally: temporarily reduce traffic to that region
- Periodically test if quota has reset
Example Scenario:
10:00:00 - us-central1 receives 60 requests (at quota limit)
10:00:01 - Request 61 returns 429 (rate limit exceeded)
10:00:01 - Client automatically switches to europe-west1
10:00:01 - Request succeeds on europe-west1
10:01:00 - Quota resets on us-central1, back to normal rotation
This effectively triples your capacity from 60 req/min to 180 req/min across three regions.
Quota Management Pro Tip:
Multi-region isn’t just for disaster recovery — it’s a capacity multiplier. Many teams implement multi-region primarily for quota management, not outages.
Common Quota Scenarios:
Scenario Single Region Multi-Region (3) Improvement Rate limit 60 req/min 180 req/min 3x capacity Daily quota 10K requests 30K requests 3x capacity Concurrent requests 10 parallel 30 parallel 3x throughput
When to prioritize multi-region for quota:
- ✅ You’re hitting rate limits during peak traffic
- ✅ You have spiky/variable traffic patterns
- ✅ You’re growing and need more capacity
- ✅ Your current quota is too expensive to increase
- ❌ You have plenty of unused quota headroom
- ❌ Your traffic is very predictable and low
Real-World Story: A team I worked with was hitting their 60 req/min limit daily at peak hours. Instead of requesting a quota increase (expensive and slow), they implemented 3-region routing. Overnight, they had 180 req/min capacity with zero additional quota costs.
Alternative Approaches:
Depending on your architecture, you might:
For basic failover:
- Use a global load balancer (e.g., Cloud Load Balancing, AWS Global Accelerator)
- Implement client-side region selection with health checks
- Use DNS-based failover
- Build a proxy service that handles region routing
For quota management:
- Token bucket per region — Track quota usage and route to available regions
- Weighted round-robin — Distribute based on remaining quota percentage
- Least-utilized routing — Always pick the region with most available quota
- Time-based rotation — If quotas reset at fixed intervals, rotate strategically
For sophisticated routing:
- Use a service mesh with locality-aware load balancing
- Implement a quota-aware API gateway
- Build a central routing service that tracks regional capacity
- Use client-side libraries with shared quota state (e.g., Redis for tracking)
Real-World Example:
# Pseudo-code for quota-aware routing
def select_region(regions, quota_tracker):
for region in regions:
if quota_tracker.has_capacity(region):
return region
# If all regions are at quota, use least-full region
return quota_tracker.get_least_utilized_region(regions)
Implementation Note: The trickiest part is often session state. If your LLM calls maintain context (like chat histories), you need a strategy for maintaining that state across regions — shared storage, session replication, or accepting that failover means losing context.
For quota management specifically, you also need to decide: do you track quota client-side (risk of overestimation) or server-side (requires centralized state)?
3. Circuit Breaker Pattern
The Problem: A failing region continues to waste time and quota on doomed requests. Your retry logic keeps hitting a brick wall.
The Pattern: Track failure rates per region. When a region exceeds a failure threshold, “open the circuit” and skip it temporarily. Test periodically to see if it has recovered.
Why It Matters: Without circuit breakers, you waste valuable seconds on every request trying regions that you know are down. With circuit breakers, you fail fast and preserve your rate limits for working regions.
Circuit Breaker State Machine:
- CLOSED: Normal operation, requests flow through
- OPEN: Too many failures, region is skipped automatically
- HALF_OPEN: Testing if region has recovered
Key Design Decisions:
- Failure threshold — How many failures before opening? (typically 3–10)
- Success threshold — How many successes to close again? (typically 1–3)
- Timeout duration — How long to wait before testing recovery? (30–120 seconds)
- Granularity — Circuit breaker per region? Per endpoint? Per model?
- Reset strategy — Automatic timeout vs. manual intervention?
Example Scenario:
10:00:00 AM - us-central1 starts failing
10:00:15 AM - After 5 failures, circuit opens for us-central1
10:00:16 AM - All requests now go directly to europe-west1 (no wasted timeouts)
10:01:15 AM - Circuit transitions to HALF_OPEN (60s timeout elapsed)
10:01:16 AM - Test request to us-central1 succeeds
10:01:16 AM - Circuit closes, us-central1 back in rotation
One Possible Implementation:
client = GeminiSREClient(
project_id="your-project",
locations=["us-central1", "europe-west1"],
enable_circuit_breaker=True,
circuit_breaker_config={
"failure_threshold": 5, # Open circuit after 5 failures
"success_threshold": 2, # Close after 2 successes
"timeout": 60, # Test recovery after 60 seconds
},
)
Alternative Implementations:
You have several options:
- Library-based: Use libraries like
pybreaker
(Python),resilience4j
(Java), orPolly
(.NET) - Service mesh: Envoy, Istio, and Linkerd have built-in circuit breakers
- API Gateway: Many gateways (Kong, Apigee) offer circuit breaker plugins
- Custom implementation: Build your own with simple counters and timers
- Cloud-native: Some cloud providers offer managed circuit breakers
Real Impact: For an application making 100 requests/minute during a 5-minute regional outage, circuit breakers save 500–1000 seconds of wasted timeout waiting. That’s 25,000–50,000 requests that can be served by working regions instead.
4. Comprehensive Observability
The Problem: When things go wrong in production, you’re flying blind. No metrics, no logs, no idea what happened.
The Pattern: Emit structured metrics and logs for every operation. Track what matters.
Why It Matters: The SRE mantra is “you can’t improve what you don’t measure.” Without observability, you can’t:
- Detect issues before users complain
- Debug problems after they occur
- Optimize performance
- Prove reliability to stakeholders
- Make data-driven decisions
What to Measure:
At minimum, you need:
- Success/Failure Rates — Are requests succeeding? What’s the error rate?
- Latency Distribution — p50, p95, p99 latencies (not just averages!)
- Retry Counts — How often are you retrying? Which errors?
- Regional Health — Which regions are healthy? Circuit breaker states?
- Request Correlation — Can you trace a request end-to-end?
One Possible Implementation:
Here’s how I integrated with Cloud Monitoring and Logging:
client = GeminiSREClient(
project_id="your-project",
locations=["us-central1", "europe-west1"],
enable_monitoring=True, # Send metrics to Cloud Monitoring
enable_logging=True, # Send logs to Cloud Logging
)# Every request automatically generates:
# - Success/failure metrics
# - Latency distributions (p50, p95, p99)
# - Retry count gauges
# - Circuit breaker state metrics
# - Structured logs with request IDs
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain observability",
request_id="obs-001", # Crucial for tracing
)
Sample Metrics to Track:
Metric Type Use Case llm_requests_total
COUNTER Track success rate, set alerts llm_request_errors_total
COUNTER Monitor error rate by type llm_request_duration_seconds
HISTOGRAM Detect performance degradation llm_request_retries_total
COUNTER Measure API instability llm_circuit_breaker_state
GAUGE Monitor regional health
Alternative Approaches:
Choose the observability stack that fits your infrastructure:
- Cloud Native: CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor
- Open Source: Prometheus + Grafana, OpenTelemetry
- Commercial: Datadog, New Relic, Honeycomb
- Logging: ELK Stack, Splunk, Loki
- Tracing: Jaeger, Zipkin, X-Ray
Critical Insight: Don’t just collect metrics — act on them. Set up alerts when:
- Error rate exceeds 5%
- p95 latency crosses 10 seconds
- Circuit breakers are open for > 2 minutes
- Retry rate exceeds 20%
Structured Logging Best Practices:
{
"timestamp": "2025-10-12T10:30:45Z",
"request_id": "obs-001",
"severity": "INFO",
"model": "gemini-2.5-flash",
"region": "us-central1",
"latency_ms": 1234,
"retry_count": 1,
"success": true,
"operation": "generate_content"
}
Key fields to always include: timestamp, request_id (for correlation), severity, latency, success/failure, and operation type.
5. Structured Output with Type Safety
The Problem: LLM responses are unstructured text. Parsing them is error-prone and brittle.
The Pattern: Define schemas for expected outputs and enforce validation.
Why It Matters: Structured outputs eliminate entire classes of bugs. No more regex parsing, no more “extract the JSON from markdown,” no more silent data corruption. You get type safety, validation, and clear contracts.
The Evolution of LLM Output Parsing:
Generation 1: Regex parsing → fragile, fails often
Generation 2: "Return JSON" in prompt → better, but unreliable
Generation 3: JSON mode → more reliable, but no schema
Generation 4: Structured outputs with schemas → type-safe, validated
Modern LLM APIs Support:
Most major providers now offer some form of structured output:
- OpenAI: Function calling, JSON mode, Structured Outputs
- Anthropic: Tool use with JSON schemas
- Google Gemini: JSON mode with Pydantic schema support
One Possible Implementation (with Pydantic):
from pydantic import BaseModel, Field
from typing import List# Define your schema
class ProductReview(BaseModel):
"""Structured product review analysis"""
sentiment: str = Field(description="positive, negative, or neutral")
rating: int = Field(description="Rating from 1-5", ge=1, le=5)
key_points: List[str] = Field(description="Main points from review")
recommendation: bool = Field(description="Would recommend?")
confidence: float = Field(description="Confidence score", ge=0.0, le=1.0)
# Generate structured output
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="""Analyze this review:
'This product exceeded my expectations! Great build quality,
fast shipping. Only minor issue was the instructions.'""",
config={
"response_mime_type": "application/json",
"response_schema": ProductReview,
},
request_id="review-001",
)
# Access typed fields - IDE autocomplete works!
print(response.parsed.sentiment) # "positive"
print(response.parsed.rating) # 4
print(response.parsed.key_points) # ["excellent quality", "fast shipping", ...]
print(response.parsed.recommendation) # True
Benefits:
- Type Safety: Catch errors at development time
- Validation: Pydantic ensures data integrity
- Documentation: Schema serves as API contract
- Testing: Easy to mock and verify
- Maintenance: Schema changes are explicit
- IDE Support: Autocomplete and type checking
Alternative Approaches:
Depending on your language and preferences:
- Python: Pydantic, dataclasses, TypedDict, marshmallow
- TypeScript: Zod, io-ts, class-validator, JSON Schema
- Go: struct tags, validator library
- Java: Jackson, GSON with annotations
- Any Language: JSON Schema for validation
Pro Tip: Even if your LLM doesn’t natively support structured outputs, you can:
- Request JSON in your prompt
- Parse the response
- Validate against your schema
- Retry with error details if validation fails
This is less reliable but still much better than unstructured parsing.
Bringing It All Together: Production-Ready Architecture
Now that we’ve covered the five core patterns, let’s talk about how they work together in a production system.
The Reliability Stack:
┌─────────────────────────────────────────┐
│ Application Layer (Your Business Logic)│
├─────────────────────────────────────────┤
│ Structured Output Validation │ ← Type safety
├─────────────────────────────────────────┤
│ Observability Layer │ ← Metrics & Logging
├─────────────────────────────────────────┤
│ Circuit Breaker │ ← Fast failure
├─────────────────────────────────────────┤
│ Multi-Region Router │ ← Failover
├─────────────────────────────────────────┤
│ Retry with Backoff │ ← Transient errors
├─────────────────────────────────────────┤
│ LLM API (Gemini/OpenAI/Claude/etc) │
└─────────────────────────────────────────┘
Each layer handles a specific concern, and together they create a resilient system.
Reference Implementation:
If you want to see how these patterns work together, here’s how I structured my reference implementation:
import os
from gemini_sre import GeminiSREClient
from gemini_sre.core import RetryConfigdef create_production_client():
"""Example production-ready configuration"""
return GeminiSREClient(
# Identity
project_id=os.getenv("GOOGLE_CLOUD_PROJECT"),
# Multi-Region High Availability
locations=[
"us-central1",
"europe-west1",
"asia-northeast1",
],
# Retry Strategy
retry_config=RetryConfig(
max_attempts=5,
initial_delay=1.0,
max_delay=16.0,
multiplier=2.0,
),
# Circuit Breaker
enable_circuit_breaker=True,
circuit_breaker_config={
"failure_threshold": 5,
"success_threshold": 2,
"timeout": 60,
},
# Observability
enable_monitoring=True,
enable_logging=True,
)
But this is just one approach. Your production architecture might:
- Use a different LLM provider (OpenAI, Anthropic, Azure OpenAI)
- Implement these patterns in a service mesh instead of client-side
- Use a different language (TypeScript, Go, Java)
- Have different retry strategies based on your SLAs
- Integrate with your existing observability stack
- Include additional patterns like rate limiting, caching, or queue-based processing
The key is having these reliability layers, not implementing them in any particular way.
Build vs. Buy vs. Adapt:
You have three paths forward:
- Build from scratch — Full control, fits your exact needs, requires time
- Use a library — Fast start, but may not fit perfectly
- Adapt an example — Take my implementation (or anyone else’s), modify it
Why I Built a Reference Implementation:
When I was solving these problems for my own production systems, I couldn’t find a complete example that showed all these patterns working together. So I built one.
It’s not meant to be “the solution” — it’s meant to be:
- A learning resource showing working code
- A starting point you can fork and customize
- A reference you can copy patterns from
- Proof that these patterns actually work together
What You Can Do With It:
- ✅ Copy the circuit breaker logic into your existing codebase
- ✅ Fork it and add your own patterns (caching, rate limiting, etc.)
- ✅ Use it as a template for building similar SDKs for other providers
- ✅ Study the architecture to understand the trade-offs
- ✅ Ignore it completely and build your own from these principles
What I Hope You’ll Do:
Don’t just copy-paste. Understand why each pattern exists, then implement it in a way that makes sense for your system. Your architecture, your constraints, your requirements might be different from mine — and that’s totally fine.
The patterns matter more than the code.
Advanced Pattern: Async for High Throughput
Beyond the core five patterns, there’s one more critical consideration for high-scale applications: concurrency.
The Problem:
When you make LLM API calls sequentially, you’re waiting for each one to complete before starting the next. If each call takes 2 seconds and you need to process 100 requests, that’s 200 seconds of waiting — over 3 minutes!
The Pattern: Async/Concurrent Processing
Instead of waiting, process multiple requests concurrently. This doesn’t make individual requests faster, but dramatically increases your throughput.
The Math:
Sequential: 100 requests × 2s each = 200 seconds
Concurrent: 100 requests / 20 parallel = 10 seconds (20x faster!)
Implementation Options:
Depending on your language and framework:
Python:
asyncio
withaiohttp
orhttpx
concurrent.futures
ThreadPoolExecutormultiprocessing
for CPU-bound work- Async LLM SDKs (most providers offer async versions)
JavaScript/TypeScript:
- Native
Promise.all()
for concurrent promises - Async/await patterns (built into the language)
- Worker threads for CPU-intensive tasks
Go:
- Goroutines (concurrency built into the language)
- Channel-based patterns
Java:
- CompletableFuture
- Virtual threads (Java 21+)
- Reactive patterns (Project Reactor, RxJava)
Example Implementation (Python asyncio):
Here’s how I implemented concurrent request processing:
import asyncio
from gemini_sre import AsyncGeminiSREClientasync def process_many_requests():
"""Process 100 requests concurrently"""
client = AsyncGeminiSREClient(
project_id=os.getenv("GOOGLE_CLOUD_PROJECT"),
locations=["us-central1", "europe-west1"],
enable_monitoring=True,
)
# Create 100 concurrent tasks
tasks = [
client.models.generate_content(
model="gemini-2.5-flash",
contents=f"Question {i}: What is the capital of country {i}?",
request_id=f"capital-{i}",
)
for i in range(100)
]
# Wait for all to complete
results = await asyncio.gather(*tasks)
return results
# Run it
results = asyncio.run(process_many_requests())
Real-World Results:
Sequential: ~200 seconds for 100 requests
Concurrent: ~15 seconds for 100 requests
Speedup: 13.3x faster!
Important Caveats:
- Rate Limits: Make sure you’re not exceeding your API quota. Use semaphores to limit concurrency:
semaphore = asyncio.Semaphore(20) # Max 20 concurrent requests
- Memory: Each concurrent request holds memory. Monitor your usage.
- Cost: More concurrent requests might cost more (depending on your pricing model).
- Error Handling: One failure shouldn’t crash all requests. Use
asyncio.gather(..., return_exceptions=True)
.
When to Use Async:
- ✅ Batch processing many requests
- ✅ User-facing APIs serving multiple clients
- ✅ Data pipelines processing large datasets
- ❌ Simple scripts running occasional requests
- ❌ When sequential processing is required (e.g., conversation context)
Async isn’t just about speed — it’s about resource efficiency. Sequential processing ties up your application waiting for I/O. Async lets you handle hundreds of requests with minimal resource overhead.
But it adds complexity. Only use it when you actually need the throughput.
The Hidden Benefits
Beyond the numbers, these patterns give you:
- Confidence to scale — You know the system handles failures
- Faster debugging — Request IDs and structured logs make issues traceable
- Easier testing — You can simulate failures and verify behavior
- Better sleep — Auto-recovery means fewer 3 AM pages
- Happier users — Consistent, reliable service builds trust
Pattern-Specific Impact
Breaking down by individual pattern:
Retry with Exponential Backoff:
- Reduces transient error rate from ~8% to ~1%
- Typical improvement: 70–80% fewer user-facing errors
Multi-Region Failover:
- Improves availability from ~99.5% to ~99.95%
- Typical improvement: 10x reduction in downtime
Circuit Breaker:
- Reduces wasted time during outages by 80–90%
- Typical improvement: 5–10s faster failure detection
Observability:
- Mean time to detection (MTTD): 30 minutes → 2 minutes
- Mean time to resolution (MTTR): 2 hours → 15 minutes
Structured Output:
- Parsing error rate: 5–10% → <0.1%
- Typical improvement: 50x more reliable response handling
Your Mileage May Vary
These numbers come from real systems, but your results will depend on:
- Your specific LLM provider’s reliability
- Your traffic patterns and query complexity
- Your regional distribution
- How you configure each pattern
The key is to measure YOUR baseline, then measure the improvement. Don’t trust my numbers — collect your own.
Getting Started: Your Path to Production Reliability
Ready to make your LLM application production-ready? Here’s a pragmatic roadmap.
Assess Current State
Before adding complexity, understand where you are:
- Measure your baseline
- What’s your current error rate?
- What’s your p95 latency?
- How many regions are you using?
- Do you have any retry logic?
- Can you trace requests?
- Identify your biggest risk
- Is it regional outages?
- Rate limiting?
- Transient errors?
- Lack of visibility?
- Define your reliability targets
- What availability do you need? (99%? 99.9%? 99.99%?)
- What latency is acceptable? (p95 < 5s? < 10s?)
- What error rate is tolerable? (< 1%? < 5%?)
Start with measurement before optimization.
Add Resilience
Pick ONE pattern to implement first (based on your biggest risk):
If your main issue is transient errors:
- Implement retry with exponential backoff
- Start with 3 retries, 1s initial delay
- Only retry on 429, 503, and timeouts
If your main issue is regional outages:
- Add a second region
- Implement simple failover (try primary, then secondary)
- Test by deliberately failing requests to your primary
If your main issue is hitting quota limits:
- Add 2–3 regions for capacity distribution
- Implement automatic 429 handling (retry on different region)
- Consider round-robin or quota-aware routing
- Calculate total capacity: N regions × per-region quota
If your main issue is cascading failures:
- Implement a circuit breaker
- Start with conservative thresholds (10 failures, 60s timeout)
- Monitor state changes closely
Pro tip: Don’t implement all patterns at once. Add one, measure the impact, tune it, then move to the next.
Add Observability
You can’t improve what you can’t measure:
- Start with basic metrics:
- request_count (total, by status)
- request_duration (histogram)
- retry_count (gauge) - Add structured logging:
- Request ID for every call
- Timestamp, latency, status
- Region, model, operation type
- Set up dashboards:
- Real-time error rate
- Latency percentiles (p50, p95, p99)
- Requests by region
- Configure alerts:
- Error rate > 5% for 5 minutes
- p95 latency > 10s for 5 minutes
- Circuit breaker open for > 2 minutes
Optimize and Scale
Now that you have data, optimize:
- Tune retry policies based on actual error patterns
- Adjust circuit breaker thresholds based on real outages
- Add structured outputs to reduce parsing errors
- Consider async if you need higher throughput
- Implement caching for repeated queries
- Add rate limiting to protect your quotas
Remember: Production readiness is a journey, not a destination. Keep iterating based on real-world data.
Learning Resources
Want to see working examples?
I’ve built a reference implementation with 16 examples covering everything from basic usage to production patterns:
- Repository: gemini-sre-client
- Examples: Basic, Advanced, Async, and Production patterns
- Setup Guide: Complete installation instructions
- Architecture Docs: Design decisions and trade-offs
Feel free to:
- Use it as-is if it fits your needs
- Fork it and modify for your use case
- Copy patterns into your existing codebase
- Use it as a learning reference
- Ignore it completely and build your own
The MIT license means you can do whatever you want with it. No strings attached.
For other LLM providers:
These patterns aren’t Gemini-specific. Look for similar implementations:
- OpenAI: Check out
openai-python
with tenacity for retries - Anthropic: Look at their SDK examples with error handling
- General: Search for “LLM SDK with circuit breaker” in your language
Or better yet, build your own. You’ll learn more that way.
Anti-Patterns to Avoid
Don’t make these common mistakes:
❌ Setting max retries too high (10+)
✅ Use 3–5 retries with exponential backoff
❌ Ignoring circuit breaker state
✅ Monitor and alert on open circuits
❌ Single region for “simplicity”
✅ Multi-region from day one
❌ No request IDs
✅ Every request gets a unique, traceable ID
❌ Blind retries on all errors
✅ Only retry transient errors (429, 503, timeouts)
❌ No timeout configuration
✅ Set aggressive timeouts (5–10s for most LLM calls)
Monitoring and Alerting Strategy
Set up these critical alerts:
# Error Rate Alert
- alert: HighErrorRate
expr: rate(gemini_sre_request_error[5m]) > 0.05
for: 5m
annotations:
summary: "Error rate above 5%"# Latency Alert
- alert: HighLatency
expr: histogram_quantile(0.95, gemini_sre_request_latency) > 10
for: 5m
annotations:
summary: "P95 latency above 10s"
# Circuit Breaker Alert
- alert: CircuitBreakerOpen
expr: gemini_sre_circuit_breaker_state == 2
for: 2m
annotations:
summary: "Circuit breaker open for region"
The Bottom Line
Building reliable LLM applications isn’t about having perfect code — it’s about expecting and handling failure gracefully. The patterns we’ve covered aren’t theoretical; they’re battle-tested practices from companies running services at massive scale.
Production-ready doesn’t mean complex. It means thoughtful:
- Retry transient errors, but not forever
- Use multiple regions, but test failover
- Track failures, but recover automatically
- Measure everything, but alert on what matters
- Validate outputs, but handle surprises
You don’t need a special library to implement these patterns. You need:
- Understanding of the principles
- Metrics to guide your decisions
- Willingness to iterate based on real-world data
My reference implementation (gemini-sre-client) exists to help you learn, not to lock you in. Use it, copy from it, ignore it — whatever helps you build better systems.
The real goal is to shift your thinking. When you’re building your next LLM feature, ask yourself:
- What happens if this API call fails?
- What happens if this region goes down?
- How will I know when things go wrong?
- How will the system recover?
Answer those questions, and you’re 80% of the way to production-ready.
Your Turn
I’d love to hear your experiences:
- What reliability challenges have you faced with LLM applications?
- Which patterns have worked well for you?
- What am I missing from this list?
Drop a comment below or reach out. These patterns evolve based on real-world experience, and your lessons help everyone.
If you found this helpful:
- Share it with your team
- Bookmark it for reference
- Try implementing one pattern this week
And if you want to explore the reference implementation, check out gemini-sre-client on GitHub. Even if you don’t use it, the examples and architecture docs might give you ideas for your own system.
Source Credit: https://medium.com/google-cloud/building-bulletproof-llm-applications-a-guide-to-applying-sre-best-practices-1564b72fd22e?source=rss—-e52cf94d98af—4