Building Bulletproof LLM Applications: A Guide to Applying SRE Best Practices | by Giorgio Crivellari | Google Cloud - Community

You’ve built a brilliant LLM-powered application. It works perfectly on your laptop. Users love the demo. Then you deploy to production, and reality hits hard:

API calls fail silently during high load
Regional outages take down your entire service
Rate limits are exceeded without warning
Errors cascade through your system
You have no visibility into what’s happening

Sound familiar? You’re not alone. The gap between a working prototype and a production-ready LLM application is vast, and it’s littered with the remains of promising projects that couldn’t handle real-world conditions.

The good news? We can bridge this gap by applying Site Reliability Engineering (SRE) principles that have kept massive distributed systems running for decades.

This article is not about promoting a specific library. Instead, I want to share the lessons I learned while building production-ready LLM applications and show you concrete patterns you can implement in your own codebase. I built a reference implementation (gemini-sre-client) to demonstrate these patterns in action, but the principles apply regardless of your tech stack or which LLM provider you use.

Feel free to use my code as inspiration, copy what works for you, or build something entirely different. The goal is to help you think about reliability from day one, not to convince you to use a particular library.

What You’ll Learn:

The 5 critical SRE patterns for LLM applications
Why each pattern matters (with real numbers)
How to implement them (multiple approaches)
When to use each pattern (and when not to)
A pragmatic roadmap for adding reliability incrementally

Who This Is For:

Developers moving LLM prototypes to production
Engineering teams building LLM-powered services
SREs supporting AI/ML workloads
Technical leaders defining reliability standards

Let’s dive in.

The Problem: LLM Applications Are Distributed Systems in Disguise

When you call an LLM API, you’re not just making a simple HTTP request. You’re:

Hitting an external dependency that you don’t control
Depending on regional infrastructure that can fail
Consuming a quota-limited resource that can throttle
Waiting for variable latencies from milliseconds to minutes
Parsing complex responses that can be malformed

Traditional error handling isn’t enough. You need the same reliability patterns that power Netflix, Google, and AWS.

The challenge is that most LLM SDKs give you the basics — authentication, request formatting, response parsing — but leave reliability as “an exercise for the reader.” This is fine for prototypes, but dangerous for production.

The Solution: Build Your Own Reliability Layer

Let’s explore five critical SRE patterns every production LLM application needs. I’ll show you the theory behind each pattern, why it matters, and concrete implementation approaches.

Throughout this article, I’ll reference my own implementation (gemini-sre-client) as a working example, but these patterns are universal — you can implement them in any language, with any LLM provider, in whatever way fits your architecture best.

1. Automatic Retry with Exponential Backoff

The Problem: Network hiccups and transient API errors cause requests to fail unnecessarily.

The Pattern: Automatically retry failed requests with increasing delays between attempts.

Why It Matters: Studies show that 70–80% of transient failures resolve within seconds. Without retries, these become user-facing errors. But naive retries can overwhelm failing services — exponential backoff gives systems time to recover.

Key Design Decisions:

When implementing retry logic, you need to decide:

How many times to retry? (typically 3–5 attempts)
Initial delay? (usually 1–2 seconds)
Maximum delay? (cap at 16–32 seconds to avoid hanging)
Backoff multiplier? (2x is standard: 1s, 2s, 4s, 8s, 16s)
Which errors to retry? (only transient: 429, 503, timeouts, not 400/401)
How to handle different error types?

429 (Rate Limit): Don’t retry same region — switch regions immediately if available
503 (Service Unavailable): Retry with backoff, might recover quickly
Timeout: Retry with backoff, but consider aggressive timeout values
400/401/403: Don’t retry, these are client errors that won’t resolve

One Possible Implementation:

Here’s how I approached this in my reference implementation:

from gemini_sre import GeminiSREClient
from gemini_sre.core import RetryConfigclient = GeminiSREClient(
project_id="your-project",
locations=["us-central1"],
retry_config=RetryConfig(
max_attempts=5,        # Try up to 5 times
initial_delay=1.0,     # Wait 1s after first failure
max_delay=16.0,        # Cap at 16s between attempts
multiplier=2.0,        # Double delay each time (1s, 2s, 4s, 8s, 16s)
),
)
# This request automatically retries on transient failures
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain quantum computing",
request_id="quantum-001",  # For tracing
)

Alternative Approaches:

You could also:

Use a library like tenacity or backoff in Python
Implement retry logic in middleware/decorators
Use your API gateway’s built-in retry policies (e.g., Envoy, Kong)
Build custom retry logic specific to your error handling strategy

The key is having some retry mechanism, not necessarily this exact implementation.

2. Multi-Region Failover

The Problem: Regional outages happen. AWS regions go down. Google Cloud regions experience degraded performance. Your single-region application becomes unavailable.

But there’s another critical issue: quota limits. Even when a region is perfectly healthy, you might hit your rate limit (e.g., 60 requests per minute). If all your traffic goes to one region, you’re artificially capping your throughput at that region’s quota — even though you have unused capacity in other regions.

The Pattern: Configure multiple regions and intelligently distribute load across them. Fail over when one becomes unavailable OR when you hit quota limits.

Why It Matters:

For availability: Most cloud providers publish regional availability around 99.5% per region. Using two independent regions, your theoretical availability jumps to 99.9975%. That’s the difference between 43 hours of downtime per year versus 13 minutes.

For capacity: If you have a quota of 60 requests/minute per region, using 3 regions gives you 180 requests/minute total capacity. Without multi-region support, you’re leaving 120 requests/minute on the table.

Key Design Decisions:

When implementing multi-region support, consider:

Region selection — Choose geographically diverse regions (different continents if possible)
Distribution strategy — Failover only? Round-robin load balancing? Quota-aware routing?
Failover triggers — What conditions cause a switch? (timeouts, error rates, 429 rate limits, circuit breaker)
Quota management — How to track and distribute capacity across regions?
Data locality — Are there compliance concerns with certain regions?
Cost implications — Different regions have different pricing

Common Multi-Region Strategies:

Strategy 1: Primary + Failover (Simplest)

Send all traffic to primary region
Only use secondary when primary fails or returns 429 (rate limit)
Good for: Basic reliability, simple quota overflow

Strategy 2: Round-Robin (Better capacity utilization)

Distribute requests evenly across all regions
Maximizes total quota (3 regions = 3x capacity)
Good for: High-throughput applications

Strategy 3: Quota-Aware Routing (Most sophisticated)

Track quota usage per region in real-time
Route to least-utilized region
Dynamically adjust based on remaining capacity
Good for: Applications with variable traffic patterns

One Possible Implementation:

Here’s a simple approach with ordered fallback that handles both failures and rate limits:

client = GeminiSREClient(
project_id="your-project",
locations=[
"us-central1",      # Primary: Iowa
"europe-west1",     # Secondary: Belgium  
"asia-northeast1",  # Tertiary: Tokyo
],
enable_circuit_breaker=True,  # Automatically skip failing regions
)# The client automatically tries regions in order
# If us-central1 fails (500, timeout) -> tries europe-west1
# If us-central1 returns 429 (rate limit) -> tries europe-west1
# If europe-west1 fails -> tries asia-northeast1
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Write me a poem about resilience",
request_id="poem-001",
)

How Quota Management Works:

When you hit a 429 rate limit error, the system should:

Immediately try the next region (don’t waste time retrying the same region)
Optionally: temporarily reduce traffic to that region
Periodically test if quota has reset

Example Scenario:

10:00:00 - us-central1 receives 60 requests (at quota limit)
10:00:01 - Request 61 returns 429 (rate limit exceeded)
10:00:01 - Client automatically switches to europe-west1
10:00:01 - Request succeeds on europe-west1
10:01:00 - Quota resets on us-central1, back to normal rotation

This effectively triples your capacity from 60 req/min to 180 req/min across three regions.

Quota Management Pro Tip:

Multi-region isn’t just for disaster recovery — it’s a capacity multiplier. Many teams implement multi-region primarily for quota management, not outages.

Common Quota Scenarios:

Scenario Single Region Multi-Region (3) Improvement Rate limit 60 req/min 180 req/min 3x capacity Daily quota 10K requests 30K requests 3x capacity Concurrent requests 10 parallel 30 parallel 3x throughput

When to prioritize multi-region for quota:

✅ You’re hitting rate limits during peak traffic
✅ You have spiky/variable traffic patterns
✅ You’re growing and need more capacity
✅ Your current quota is too expensive to increase
❌ You have plenty of unused quota headroom
❌ Your traffic is very predictable and low

Real-World Story: A team I worked with was hitting their 60 req/min limit daily at peak hours. Instead of requesting a quota increase (expensive and slow), they implemented 3-region routing. Overnight, they had 180 req/min capacity with zero additional quota costs.

Alternative Approaches:

Depending on your architecture, you might:

For basic failover:

Use a global load balancer (e.g., Cloud Load Balancing, AWS Global Accelerator)
Implement client-side region selection with health checks
Use DNS-based failover
Build a proxy service that handles region routing

For quota management:

Token bucket per region — Track quota usage and route to available regions
Weighted round-robin — Distribute based on remaining quota percentage
Least-utilized routing — Always pick the region with most available quota
Time-based rotation — If quotas reset at fixed intervals, rotate strategically

For sophisticated routing:

Use a service mesh with locality-aware load balancing
Implement a quota-aware API gateway
Build a central routing service that tracks regional capacity
Use client-side libraries with shared quota state (e.g., Redis for tracking)

Real-World Example:

# Pseudo-code for quota-aware routing
def select_region(regions, quota_tracker):
for region in regions:
if quota_tracker.has_capacity(region):
return region
# If all regions are at quota, use least-full region
return quota_tracker.get_least_utilized_region(regions)

Implementation Note: The trickiest part is often session state. If your LLM calls maintain context (like chat histories), you need a strategy for maintaining that state across regions — shared storage, session replication, or accepting that failover means losing context.

For quota management specifically, you also need to decide: do you track quota client-side (risk of overestimation) or server-side (requires centralized state)?

3. Circuit Breaker Pattern

The Problem: A failing region continues to waste time and quota on doomed requests. Your retry logic keeps hitting a brick wall.

The Pattern: Track failure rates per region. When a region exceeds a failure threshold, “open the circuit” and skip it temporarily. Test periodically to see if it has recovered.

Why It Matters: Without circuit breakers, you waste valuable seconds on every request trying regions that you know are down. With circuit breakers, you fail fast and preserve your rate limits for working regions.

Circuit Breaker State Machine:

CLOSED: Normal operation, requests flow through
OPEN: Too many failures, region is skipped automatically
HALF_OPEN: Testing if region has recovered

Key Design Decisions:

Failure threshold — How many failures before opening? (typically 3–10)
Success threshold — How many successes to close again? (typically 1–3)
Timeout duration — How long to wait before testing recovery? (30–120 seconds)
Granularity — Circuit breaker per region? Per endpoint? Per model?
Reset strategy — Automatic timeout vs. manual intervention?

Example Scenario:

10:00:00 AM - us-central1 starts failing
10:00:15 AM - After 5 failures, circuit opens for us-central1
10:00:16 AM - All requests now go directly to europe-west1 (no wasted timeouts)
10:01:15 AM - Circuit transitions to HALF_OPEN (60s timeout elapsed)
10:01:16 AM - Test request to us-central1 succeeds
10:01:16 AM - Circuit closes, us-central1 back in rotation

One Possible Implementation:

client = GeminiSREClient(
project_id="your-project",
locations=["us-central1", "europe-west1"],
enable_circuit_breaker=True,
circuit_breaker_config={
"failure_threshold": 5,    # Open circuit after 5 failures
"success_threshold": 2,    # Close after 2 successes
"timeout": 60,             # Test recovery after 60 seconds
},
)

Alternative Implementations:

You have several options:

Library-based: Use libraries like pybreaker (Python), resilience4j (Java), or Polly (.NET)
Service mesh: Envoy, Istio, and Linkerd have built-in circuit breakers
API Gateway: Many gateways (Kong, Apigee) offer circuit breaker plugins
Custom implementation: Build your own with simple counters and timers
Cloud-native: Some cloud providers offer managed circuit breakers

Real Impact: For an application making 100 requests/minute during a 5-minute regional outage, circuit breakers save 500–1000 seconds of wasted timeout waiting. That’s 25,000–50,000 requests that can be served by working regions instead.

4. Comprehensive Observability

The Problem: When things go wrong in production, you’re flying blind. No metrics, no logs, no idea what happened.

The Pattern: Emit structured metrics and logs for every operation. Track what matters.

Why It Matters: The SRE mantra is “you can’t improve what you don’t measure.” Without observability, you can’t:

Detect issues before users complain
Debug problems after they occur
Optimize performance
Prove reliability to stakeholders
Make data-driven decisions

What to Measure:

At minimum, you need:

Success/Failure Rates — Are requests succeeding? What’s the error rate?
Latency Distribution — p50, p95, p99 latencies (not just averages!)
Retry Counts — How often are you retrying? Which errors?
Regional Health — Which regions are healthy? Circuit breaker states?
Request Correlation — Can you trace a request end-to-end?

One Possible Implementation:

Here’s how I integrated with Cloud Monitoring and Logging:

client = GeminiSREClient(
project_id="your-project",
locations=["us-central1", "europe-west1"],
enable_monitoring=True,  # Send metrics to Cloud Monitoring
enable_logging=True,      # Send logs to Cloud Logging
)# Every request automatically generates:
# - Success/failure metrics
# - Latency distributions (p50, p95, p99)
# - Retry count gauges
# - Circuit breaker state metrics
# - Structured logs with request IDs
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain observability",
request_id="obs-001",  # Crucial for tracing
)

Sample Metrics to Track:

Metric Type Use Case llm_requests_total COUNTER Track success rate, set alerts llm_request_errors_total COUNTER Monitor error rate by type llm_request_duration_seconds HISTOGRAM Detect performance degradation llm_request_retries_total COUNTER Measure API instability llm_circuit_breaker_state GAUGE Monitor regional health

Alternative Approaches:

Choose the observability stack that fits your infrastructure:

Cloud Native: CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor
Open Source: Prometheus + Grafana, OpenTelemetry
Commercial: Datadog, New Relic, Honeycomb
Logging: ELK Stack, Splunk, Loki
Tracing: Jaeger, Zipkin, X-Ray

Critical Insight: Don’t just collect metrics — act on them. Set up alerts when:

Error rate exceeds 5%
p95 latency crosses 10 seconds
Circuit breakers are open for > 2 minutes
Retry rate exceeds 20%

Structured Logging Best Practices:

{
"timestamp": "2025-10-12T10:30:45Z",
"request_id": "obs-001",
"severity": "INFO",
"model": "gemini-2.5-flash",
"region": "us-central1",
"latency_ms": 1234,
"retry_count": 1,
"success": true,
"operation": "generate_content"
}

Key fields to always include: timestamp, request_id (for correlation), severity, latency, success/failure, and operation type.

5. Structured Output with Type Safety

The Problem: LLM responses are unstructured text. Parsing them is error-prone and brittle.

The Pattern: Define schemas for expected outputs and enforce validation.

Why It Matters: Structured outputs eliminate entire classes of bugs. No more regex parsing, no more “extract the JSON from markdown,” no more silent data corruption. You get type safety, validation, and clear contracts.

The Evolution of LLM Output Parsing:

Generation 1: Regex parsing → fragile, fails often
Generation 2: "Return JSON" in prompt → better, but unreliable
Generation 3: JSON mode → more reliable, but no schema
Generation 4: Structured outputs with schemas → type-safe, validated

Modern LLM APIs Support:

Most major providers now offer some form of structured output:

OpenAI: Function calling, JSON mode, Structured Outputs
Anthropic: Tool use with JSON schemas
Google Gemini: JSON mode with Pydantic schema support

One Possible Implementation (with Pydantic):

from pydantic import BaseModel, Field
from typing import List# Define your schema
class ProductReview(BaseModel):
"""Structured product review analysis"""
sentiment: str = Field(description="positive, negative, or neutral")
rating: int = Field(description="Rating from 1-5", ge=1, le=5)
key_points: List[str] = Field(description="Main points from review")
recommendation: bool = Field(description="Would recommend?")
confidence: float = Field(description="Confidence score", ge=0.0, le=1.0)
# Generate structured output
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="""Analyze this review: 
'This product exceeded my expectations! Great build quality, 
fast shipping. Only minor issue was the instructions.'""",
config={
"response_mime_type": "application/json",
"response_schema": ProductReview,
},
request_id="review-001",
)
# Access typed fields - IDE autocomplete works!
print(response.parsed.sentiment)        # "positive"
print(response.parsed.rating)           # 4
print(response.parsed.key_points)       # ["excellent quality", "fast shipping", ...]
print(response.parsed.recommendation)   # True

Benefits:

Type Safety: Catch errors at development time
Validation: Pydantic ensures data integrity
Documentation: Schema serves as API contract
Testing: Easy to mock and verify
Maintenance: Schema changes are explicit
IDE Support: Autocomplete and type checking

Alternative Approaches:

Depending on your language and preferences:

Python: Pydantic, dataclasses, TypedDict, marshmallow
TypeScript: Zod, io-ts, class-validator, JSON Schema
Go: struct tags, validator library
Java: Jackson, GSON with annotations
Any Language: JSON Schema for validation

Pro Tip: Even if your LLM doesn’t natively support structured outputs, you can:

Request JSON in your prompt
Parse the response
Validate against your schema
Retry with error details if validation fails

This is less reliable but still much better than unstructured parsing.

Bringing It All Together: Production-Ready Architecture

Now that we’ve covered the five core patterns, let’s talk about how they work together in a production system.

The Reliability Stack:

┌─────────────────────────────────────────┐
│  Application Layer (Your Business Logic)│
├─────────────────────────────────────────┤
│  Structured Output Validation           │ ← Type safety
├─────────────────────────────────────────┤
│  Observability Layer                    │ ← Metrics & Logging
├─────────────────────────────────────────┤
│  Circuit Breaker                        │ ← Fast failure
├─────────────────────────────────────────┤
│  Multi-Region Router                    │ ← Failover
├─────────────────────────────────────────┤
│  Retry with Backoff                     │ ← Transient errors
├─────────────────────────────────────────┤
│  LLM API (Gemini/OpenAI/Claude/etc)     │
└─────────────────────────────────────────┘

Each layer handles a specific concern, and together they create a resilient system.

Reference Implementation:

If you want to see how these patterns work together, here’s how I structured my reference implementation:

import os
from gemini_sre import GeminiSREClient
from gemini_sre.core import RetryConfigdef create_production_client():
"""Example production-ready configuration"""
return GeminiSREClient(
# Identity
project_id=os.getenv("GOOGLE_CLOUD_PROJECT"),
# Multi-Region High Availability
locations=[
"us-central1",
"europe-west1",
"asia-northeast1",
],
# Retry Strategy
retry_config=RetryConfig(
max_attempts=5,
initial_delay=1.0,
max_delay=16.0,
multiplier=2.0,
),
# Circuit Breaker
enable_circuit_breaker=True,
circuit_breaker_config={
"failure_threshold": 5,
"success_threshold": 2,
"timeout": 60,
},
# Observability
enable_monitoring=True,
enable_logging=True,
)

But this is just one approach. Your production architecture might:

Use a different LLM provider (OpenAI, Anthropic, Azure OpenAI)
Implement these patterns in a service mesh instead of client-side
Use a different language (TypeScript, Go, Java)
Have different retry strategies based on your SLAs
Integrate with your existing observability stack
Include additional patterns like rate limiting, caching, or queue-based processing

The key is having these reliability layers, not implementing them in any particular way.

Build vs. Buy vs. Adapt:

You have three paths forward:

Build from scratch — Full control, fits your exact needs, requires time
Use a library — Fast start, but may not fit perfectly
Adapt an example — Take my implementation (or anyone else’s), modify it

Why I Built a Reference Implementation:

When I was solving these problems for my own production systems, I couldn’t find a complete example that showed all these patterns working together. So I built one.

It’s not meant to be “the solution” — it’s meant to be:

A learning resource showing working code
A starting point you can fork and customize
A reference you can copy patterns from
Proof that these patterns actually work together

What You Can Do With It:

✅ Copy the circuit breaker logic into your existing codebase
✅ Fork it and add your own patterns (caching, rate limiting, etc.)
✅ Use it as a template for building similar SDKs for other providers
✅ Study the architecture to understand the trade-offs
✅ Ignore it completely and build your own from these principles

What I Hope You’ll Do:

Don’t just copy-paste. Understand why each pattern exists, then implement it in a way that makes sense for your system. Your architecture, your constraints, your requirements might be different from mine — and that’s totally fine.

The patterns matter more than the code.

Advanced Pattern: Async for High Throughput

Beyond the core five patterns, there’s one more critical consideration for high-scale applications: concurrency.

The Problem:

When you make LLM API calls sequentially, you’re waiting for each one to complete before starting the next. If each call takes 2 seconds and you need to process 100 requests, that’s 200 seconds of waiting — over 3 minutes!

The Pattern: Async/Concurrent Processing

Instead of waiting, process multiple requests concurrently. This doesn’t make individual requests faster, but dramatically increases your throughput.

The Math:

Sequential:   100 requests × 2s each = 200 seconds
Concurrent:   100 requests / 20 parallel = 10 seconds (20x faster!)

Implementation Options:

Depending on your language and framework:

Python:

asyncio with aiohttp or httpx
concurrent.futures ThreadPoolExecutor
multiprocessing for CPU-bound work
Async LLM SDKs (most providers offer async versions)

JavaScript/TypeScript:

Native Promise.all() for concurrent promises
Async/await patterns (built into the language)
Worker threads for CPU-intensive tasks

Go:

Goroutines (concurrency built into the language)
Channel-based patterns

Java:

CompletableFuture
Virtual threads (Java 21+)
Reactive patterns (Project Reactor, RxJava)

Example Implementation (Python asyncio):

Here’s how I implemented concurrent request processing:

import asyncio
from gemini_sre import AsyncGeminiSREClientasync def process_many_requests():
"""Process 100 requests concurrently"""
client = AsyncGeminiSREClient(
project_id=os.getenv("GOOGLE_CLOUD_PROJECT"),
locations=["us-central1", "europe-west1"],
enable_monitoring=True,
)
# Create 100 concurrent tasks
tasks = [
client.models.generate_content(
model="gemini-2.5-flash",
contents=f"Question {i}: What is the capital of country {i}?",
request_id=f"capital-{i}",
)
for i in range(100)
]
# Wait for all to complete
results = await asyncio.gather(*tasks)
return results
# Run it
results = asyncio.run(process_many_requests())

Real-World Results:

Sequential:  ~200 seconds for 100 requests
Concurrent:  ~15 seconds for 100 requests
Speedup:     13.3x faster!

Important Caveats:

Rate Limits: Make sure you’re not exceeding your API quota. Use semaphores to limit concurrency:
semaphore = asyncio.Semaphore(20) # Max 20 concurrent requests
Memory: Each concurrent request holds memory. Monitor your usage.
Cost: More concurrent requests might cost more (depending on your pricing model).
Error Handling: One failure shouldn’t crash all requests. Use asyncio.gather(..., return_exceptions=True).

When to Use Async:

✅ Batch processing many requests
✅ User-facing APIs serving multiple clients
✅ Data pipelines processing large datasets
❌ Simple scripts running occasional requests
❌ When sequential processing is required (e.g., conversation context)

Async isn’t just about speed — it’s about resource efficiency. Sequential processing ties up your application waiting for I/O. Async lets you handle hundreds of requests with minimal resource overhead.

But it adds complexity. Only use it when you actually need the throughput.

The Hidden Benefits

Beyond the numbers, these patterns give you:

Confidence to scale — You know the system handles failures
Faster debugging — Request IDs and structured logs make issues traceable
Easier testing — You can simulate failures and verify behavior
Better sleep — Auto-recovery means fewer 3 AM pages
Happier users — Consistent, reliable service builds trust

Pattern-Specific Impact

Breaking down by individual pattern:

Retry with Exponential Backoff:

Reduces transient error rate from ~8% to ~1%
Typical improvement: 70–80% fewer user-facing errors

Multi-Region Failover:

Improves availability from ~99.5% to ~99.95%
Typical improvement: 10x reduction in downtime

Circuit Breaker:

Reduces wasted time during outages by 80–90%
Typical improvement: 5–10s faster failure detection

Observability:

Mean time to detection (MTTD): 30 minutes → 2 minutes
Mean time to resolution (MTTR): 2 hours → 15 minutes

Structured Output:

Parsing error rate: 5–10% → <0.1%
Typical improvement: 50x more reliable response handling

Your Mileage May Vary

These numbers come from real systems, but your results will depend on:

Your specific LLM provider’s reliability
Your traffic patterns and query complexity
Your regional distribution
How you configure each pattern

The key is to measure YOUR baseline, then measure the improvement. Don’t trust my numbers — collect your own.

Getting Started: Your Path to Production Reliability

Ready to make your LLM application production-ready? Here’s a pragmatic roadmap.

Assess Current State

Before adding complexity, understand where you are:

Measure your baseline

What’s your current error rate?
What’s your p95 latency?
How many regions are you using?
Do you have any retry logic?
Can you trace requests?

Identify your biggest risk

Is it regional outages?
Rate limiting?
Transient errors?
Lack of visibility?

Define your reliability targets

What availability do you need? (99%? 99.9%? 99.99%?)
What latency is acceptable? (p95 < 5s? < 10s?)
What error rate is tolerable? (< 1%? < 5%?)

Start with measurement before optimization.

Add Resilience

Pick ONE pattern to implement first (based on your biggest risk):

If your main issue is transient errors:

Implement retry with exponential backoff
Start with 3 retries, 1s initial delay
Only retry on 429, 503, and timeouts

If your main issue is regional outages:

Add a second region
Implement simple failover (try primary, then secondary)
Test by deliberately failing requests to your primary

If your main issue is hitting quota limits:

Add 2–3 regions for capacity distribution
Implement automatic 429 handling (retry on different region)
Consider round-robin or quota-aware routing
Calculate total capacity: N regions × per-region quota

If your main issue is cascading failures:

Implement a circuit breaker
Start with conservative thresholds (10 failures, 60s timeout)
Monitor state changes closely

Pro tip: Don’t implement all patterns at once. Add one, measure the impact, tune it, then move to the next.

Add Observability

You can’t improve what you can’t measure:

Start with basic metrics:
- request_count (total, by status) - request_duration (histogram) - retry_count (gauge)
Add structured logging:

Request ID for every call
Timestamp, latency, status
Region, model, operation type

Set up dashboards:

Real-time error rate
Latency percentiles (p50, p95, p99)
Requests by region

Configure alerts:

Error rate > 5% for 5 minutes
p95 latency > 10s for 5 minutes
Circuit breaker open for > 2 minutes

Example of dashboard built on Google Cloud Monitoring with the custom metrics generated by the library

Optimize and Scale

Now that you have data, optimize:

Tune retry policies based on actual error patterns
Adjust circuit breaker thresholds based on real outages
Add structured outputs to reduce parsing errors
Consider async if you need higher throughput
Implement caching for repeated queries
Add rate limiting to protect your quotas

Remember: Production readiness is a journey, not a destination. Keep iterating based on real-world data.

Learning Resources

Want to see working examples?

I’ve built a reference implementation with 16 examples covering everything from basic usage to production patterns:

Repository: gemini-sre-client
Examples: Basic, Advanced, Async, and Production patterns
Setup Guide: Complete installation instructions
Architecture Docs: Design decisions and trade-offs

Feel free to:

Use it as-is if it fits your needs
Fork it and modify for your use case
Copy patterns into your existing codebase
Use it as a learning reference
Ignore it completely and build your own

The MIT license means you can do whatever you want with it. No strings attached.

For other LLM providers:

These patterns aren’t Gemini-specific. Look for similar implementations:

OpenAI: Check out openai-python with tenacity for retries
Anthropic: Look at their SDK examples with error handling
General: Search for “LLM SDK with circuit breaker” in your language

Or better yet, build your own. You’ll learn more that way.

Anti-Patterns to Avoid

Don’t make these common mistakes:

❌ Setting max retries too high (10+)
✅ Use 3–5 retries with exponential backoff

❌ Ignoring circuit breaker state
✅ Monitor and alert on open circuits

❌ Single region for “simplicity”
✅ Multi-region from day one

❌ No request IDs
✅ Every request gets a unique, traceable ID

❌ Blind retries on all errors
✅ Only retry transient errors (429, 503, timeouts)

❌ No timeout configuration
✅ Set aggressive timeouts (5–10s for most LLM calls)

Monitoring and Alerting Strategy

Set up these critical alerts:

# Error Rate Alert
- alert: HighErrorRate
expr: rate(gemini_sre_request_error[5m]) > 0.05
for: 5m
annotations:
summary: "Error rate above 5%"# Latency Alert  
- alert: HighLatency
expr: histogram_quantile(0.95, gemini_sre_request_latency) > 10
for: 5m
annotations:
summary: "P95 latency above 10s"
# Circuit Breaker Alert
- alert: CircuitBreakerOpen
expr: gemini_sre_circuit_breaker_state == 2
for: 2m
annotations:
summary: "Circuit breaker open for region"

The Bottom Line

Building reliable LLM applications isn’t about having perfect code — it’s about expecting and handling failure gracefully. The patterns we’ve covered aren’t theoretical; they’re battle-tested practices from companies running services at massive scale.

Production-ready doesn’t mean complex. It means thoughtful:

Retry transient errors, but not forever
Use multiple regions, but test failover
Track failures, but recover automatically
Measure everything, but alert on what matters
Validate outputs, but handle surprises

You don’t need a special library to implement these patterns. You need:

Understanding of the principles
Metrics to guide your decisions
Willingness to iterate based on real-world data

My reference implementation (gemini-sre-client) exists to help you learn, not to lock you in. Use it, copy from it, ignore it — whatever helps you build better systems.

The real goal is to shift your thinking. When you’re building your next LLM feature, ask yourself:

What happens if this API call fails?
What happens if this region goes down?
How will I know when things go wrong?
How will the system recover?

Answer those questions, and you’re 80% of the way to production-ready.

Your Turn

I’d love to hear your experiences:

What reliability challenges have you faced with LLM applications?
Which patterns have worked well for you?
What am I missing from this list?

Drop a comment below or reach out. These patterns evolve based on real-world experience, and your lessons help everyone.

If you found this helpful:

Share it with your team
Bookmark it for reference
Try implementing one pattern this week

And if you want to explore the reference implementation, check out gemini-sre-client on GitHub. Even if you don’t use it, the examples and architecture docs might give you ideas for your own system.

Source Credit: https://medium.com/google-cloud/building-bulletproof-llm-applications-a-guide-to-applying-sre-best-practices-1564b72fd22e?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

Build multi-step applications and AI workflows with AWS Lambda durable functions

Accelerate Medical Discovery with PubMed in BigQuery

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

You may have missed

Build multi-step applications and AI workflows with AWS Lambda durable functions

Accelerate Medical Discovery with PubMed in BigQuery

How to fix the 403 Forbidden error (11 simple methods)

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

The Problem: LLM Applications Are Distributed Systems in Disguise

The Solution: Build Your Own Reliability Layer

1. Automatic Retry with Exponential Backoff

2. Multi-Region Failover

3. Circuit Breaker Pattern

4. Comprehensive Observability

5. Structured Output with Type Safety

Bringing It All Together: Production-Ready Architecture

Advanced Pattern: Async for High Throughput

The Hidden Benefits

Pattern-Specific Impact

Your Mileage May Vary

Getting Started: Your Path to Production Reliability

Assess Current State

Add Resilience

Add Observability

Optimize and Scale

Learning Resources

Anti-Patterns to Avoid

Monitoring and Alerting Strategy

The Bottom Line

Your Turn

About the Author

Related Stories

You may have missed