MCP Reliability Series — Part 3: mcp-chaos-monkey for Python

In Part 2 of this series, I introduced mcp-chaos-monkey — a fault injection framework purpose-built for MCP pipelines. That article covered the design philosophy, the Netflix Chaos Monkey inspiration, and the core architecture using Node.js/TypeScript. If you haven’t read it yet, I recommend starting there.
This article covers what’s new: a native Python implementation of mcp-chaos-monkey — same design, same 8 fault types, same API shape, but built from the ground up for the Python ecosystem.
Why a Python Port?
The MCP ecosystem is bilingual. The official MCP SDK ships in both TypeScript and Python. Teams building MCP agents use FastAPI, Starlette, and httpx just as often as Express and fetch. If your resilience testing framework only speaks one language, half the ecosystem can’t use it.
But this isn’t a mechanical translation. Python has its own idioms — dataclasses instead of discriminated unions, threading.Lock instead of single-threaded event loops, asyncio.sleep alongside time.sleep, ASGI instead of Express. The Python implementation respects these idioms while maintaining API compatibility with the TypeScript version.
The result: you can use the same chaos scenarios, the same admin endpoints, and even the same JSON fault configs across both languages.
What Stays the Same
If you’ve read Part 2, you already know the core architecture. Here’s a quick recap of what carries over unchanged:
- 8 fault types: latency, error, timeout, malformed, connection-refused, connection-drop, rate-limit, schema-mismatch
- Transport-level injection: faults sit below the resilience layer, above the transport — so circuit breakers, retries, and timeouts trigger organically
- ChaosController singleton: central registry for active faults with auto-expiration, probability gating, and request counting
- Production safety guards: refuses to run when ENVIRONMENT=production or CHAOS_ENABLED is not true
- Zero runtime dependencies: interceptors use optional deps
- Admin REST API + CLI for runtime fault management
- Scenario builder for reproducible chaos tests
The architecture diagram from Part 2 applies directly — just swap the transport layer names:

Now let’s look at what’s different under the hood.
Dataclass Fault Types Instead of Discriminated Unions
In TypeScript, fault types are modeled as a discriminated union — a single FaultConfig type where the type field narrows the available fields at compile time. Python doesn't have discriminated unions, but it has something equally expressive: dataclasses.
Each fault type is its own dataclass with only the fields it needs:
@dataclass
class LatencyFault:
delay_ms: int
probability: float | None = None
type: str = field(default="latency", init=False)
@dataclass
class ErrorFault:
status_code: int
message: str | None = None
probability: float | None = None
type: str = field(default="error", init=False)
@dataclass
class TimeoutFault:
hang_ms: int
probability: float | None = None
type: str = field(default="timeout", init=False)
The type field is set automatically via init=False — you can't accidentally create a LatencyFault with type="error". And the union type is a simple Python type alias:
FaultConfig = (
LatencyFault | ErrorFault | TimeoutFault | MalformedFault
| ConnectionRefusedFault | ConnectionDropFault
| RateLimitFault | SchemaMismatchFault
)
This gives you the same safety at the API boundary. Type checkers like mypy enforce that you provide delay_ms for a LatencyFault and status_code for an ErrorFault. Try passing the wrong fields and mypy catches it before you run a single test.
Thread-Safe ChaosController
This is the biggest architectural difference from TypeScript.
Node.js runs on a single-threaded event loop. JavaScript’s Map doesn't need locking because only one piece of code executes at a time. Python has real threads. If your MCP agent uses threading or concurrent.futures (and many do), concurrent access to the fault registry will corrupt it.
The Python ChaosController uses threading.Lock for every operation:
class ChaosController:
_instance: ChaosController | None = None
_instance_lock = threading.Lock()
def __init__(self) -> None:
assert_chaos_allowed() # Safety guard — throws in production
self._faults: dict[str, _ActiveFault] = {}
self._lock = threading.Lock()
@classmethod
def get_instance(cls) -> ChaosController:
if cls._instance is None:
with cls._instance_lock:
if cls._instance is None:
cls._instance = cls()
return cls._instance
def inject(self, target, config, duration_ms=None) -> str:
suffix = "".join(random.choices(string.ascii_lowercase + string.digits, k=6))
now = time.time() * 1000
fault_id = f"{target}-{int(now)}-{suffix}"
with self._lock:
self._sweep_expired(now)
if len(self._faults) >= MAX_FAULTS:
raise ValueError("Maximum number of active faults exceeded")
self._faults[fault_id] = _ActiveFault(
target=target, config=config,
activated_at=now,
expires_at=now + duration_ms if duration_ms is not None else None,
request_count=0,
)
return fault_id
Key design decisions that differ from TypeScript:
- Double-checked locking on get_instance() — the classic pattern for thread-safe singletons in Python
- Separate locks for instance creation (_instance_lock) and fault operations (_lock) — avoids contention between first-access and normal operations
- Lock-protected sweep — expired faults are cleaned up inside the lock to prevent race conditions during iteration
- Memory bound — MAX_FAULTS = 1000 prevents unbounded growth if test code has a leak
The get_fault() method demonstrates why locking matters — it iterates the fault map, checks expiration, rolls probability dice, and increments a counter, all atomically:
def get_fault(self, target: FaultTarget) -> FaultConfig | None:
now = time.time() * 1000
with self._lock:
expired: list[str] = []
matched_config: FaultConfig | None = None
for fault_id, fault in self._faults.items():
if fault.target != target:
continue
if fault.expires_at is not None and now > fault.expires_at:
expired.append(fault_id)
continue
if matched_config is None:
if (fault.config.probability is not None
and random.random() > fault.config.probability):
break # First-match semantics
fault.request_count += 1
matched_config = fault.config
for eid in expired:
self._faults.pop(eid, None)
return matched_config
Without the lock, two concurrent requests could both read request_count, both increment to the same value, and you'd lose a count. In a chaos test that asserts "the circuit breaker tripped after exactly 3 requests," that's a flaky test waiting to happen.
httpx Transport Wrapping Instead of Fetch Interception
The TypeScript version wraps fetch with a higher-order function. Python's MCP SDK uses httpx, which has a cleaner extension point: custom transports.
Instead of wrapping a function, we subclass httpx.AsyncBaseTransport:
class _ChaosAsyncTransport(httpx.AsyncBaseTransport):
def __init__(self, target: FaultTarget, transport):
self._target = target
self._transport = transport
async def handle_async_request(self, request):
controller = ChaosController.get_instance()
fault = controller.get_fault(self._target)
if fault is None:
return await self._transport.handle_async_request(request)
return await _apply_async_fault(fault, request, self._transport)
The fault application uses Python’s match statement (Python 3.10+) — a natural fit for dispatching on fault type:
async def _apply_async_fault(fault, request, transport):
match fault.type:
case "latency":
await asyncio.sleep(fault.delay_ms / 1000)
return await transport.handle_async_request(request)
case "error":
return httpx.Response(
status_code=fault.status_code,
content=json.dumps({"error": fault.message or "Chaos injected error"}).encode(),
headers={"Content-Type": "application/json"},
request=request,
)
case "timeout":
await asyncio.sleep(fault.hang_ms / 1000)
raise httpx.ReadTimeout("Chaos timeout", request=request)
case "connection-refused":
raise httpx.ConnectError("Chaos: connection refused", request=request)
case "malformed":
return httpx.Response(
status_code=200,
content=b"<<<CORRUPTED_RESPONSE>>>{{{{not json",
headers={"Content-Type": "application/json"},
request=request,
)
case "connection-drop":
raise httpx.ReadError("Chaos: connection dropped", request=request)
Notice the exception types: httpx.ReadTimeout, httpx.ConnectError, httpx.ReadError. These are the exact same exceptions your code will see in production when httpx encounters real network failures. This is not mocking — this is producing identical error signatures.
A key advantage of the transport approach: both async and sync clients are supported. The TypeScript version only wraps async fetch. The Python version ships both _ChaosAsyncTransport and _ChaosSyncTransport, so you can chaos-test synchronous code too:
from mcp_chaos_monkey.interceptors import create_chaos_aware_client
from mcp_chaos_monkey.interceptors.http_interceptor import create_chaos_aware_client_sync
# Async
async_client = create_chaos_aware_client("weather-api", httpx.AsyncClient())
# Sync
sync_client = create_chaos_aware_client_sync("weather-api", httpx.Client())
Redis: Monkey-Patching Instead of Proxy
The TypeScript version wraps ioredis with a proxy object. Python’s redis-py library doesn't have a proxy-friendly API, so the Python interceptor uses monkey-patching — replacing methods on the client instance at runtime:
unwrap = wrap_redis_with_chaos(redis_client, "redis")
# Now redis_client.get(), .set(), .delete(), etc. check for chaos faults
controller.inject("redis", ConnectionRefusedFault())
redis_client.get("key") # → raises ConnectionError
# Restore original behavior
unwrap()
The wrap_redis_with_chaos function detects whether it's wrapping a sync redis.Redis or async redis.asyncio.Redis client and patches accordingly. It returns an unwrap function for clean teardown — essential for test isolation.
ASGI Middleware Instead of Express Middleware
Where TypeScript uses Express middleware for auth fault injection, Python uses ASGI middleware — the universal interface for Starlette, FastAPI, and any ASGI-compatible framework:
rom starlette.applications import Starlette
from mcp_chaos_monkey.interceptors import ChaosAuthMiddleware
app = Starlette(routes=[...])
app = ChaosAuthMiddleware(app, target="oauth-token")
# Inject auth failures
controller.inject("oauth-token", ErrorFault(status_code=401, message="Token expired"))
# All requests → 401 {"error": "token_invalid", "message": "Token expired"}
# Inject auth latency
controller.inject("oauth-token", LatencyFault(delay_ms=2000))
# All requests → 2s delay, then forwarded normally
With FastAPI, you can stack multiple auth targets:
from fastapi import FastAPI
from mcp_chaos_monkey.interceptors import ChaosAuthMiddleware
app = FastAPI()
app.add_middleware(ChaosAuthMiddleware, target="oauth-token")
app.add_middleware(ChaosAuthMiddleware, target="api-key")
Cross-Language Compatibility
Since many teams run both TypeScript and Python MCP services, the fault config format is intentionally compatible. The parse_fault_config() function accepts both camelCase (TypeScript convention) and snake_case (Python convention):
from mcp_chaos_monkey import parse_fault_config
# Both produce the same LatencyFault(delay_ms=2000)
fault = parse_fault_config({"type": "latency", "delayMs": 2000}) # camelCase
fault = parse_fault_config({"type": "latency", "delay_ms": 2000}) # snake_case
This means a shared admin API or CI pipeline can inject faults into both TypeScript and Python services using the same JSON payloads.
Starlette Admin Endpoints
The admin API ships as framework-agnostic handler functions plus optional Starlette route registration:
from mcp_chaos_monkey.admin_endpoint import create_starlette_routes
from starlette.applications import Starlette
routes = create_starlette_routes()
app = Starlette(routes=routes)
This registers 4 routes:

Admin endpoints are protected by a timing-safe token comparison using hmac.compare_digest — set CHAOS_ADMIN_TOKEN to enable authentication.
The handler functions (handle_status, handle_inject, handle_clear, handle_clear_all) are framework-agnostic, so you can wire them into Django, Flask, or any other framework with minimal glue code.
CLI and Scenario Builder
These work identically to the TypeScript version. The CLI installs as a console script:
pip install mcp-chaos-monkey
export CHAOS_ENABLED=true
mcp-chaos inject weather-api error --status 503
mcp-chaos inject redis latency --delay 2000 --duration 30
mcp-chaos status
mcp-chaos clear-all
The scenario builder uses Python dataclasses instead of TypeScript objects, but the shape is the same:
from mcp_chaos_monkey import define_scenario, ScenarioFault, TimeoutFault, ErrorFault
api_timeout = define_scenario(
name="weather-api-timeout",
description="Weather API hangs for 10s — timeout fires, retries exhaust",
faults=[
ScenarioFault(target="weather-api", config=TimeoutFault(hang_ms=10_000)),
],
expected_behavior="Each attempt times out at 5s. Retries fire and all fail.",
assertions=[
"Each retry attempt hits timeout at 5s",
"Total latency bounded by retry * timeout",
],
)
Pluggable Logger
The TypeScript version ships with a pino-compatible logger interface. The Python version uses a Protocol class that matches Python's stdlib logging.Logger:
from mcp_chaos_monkey import configure_chaos_logger
import structlog
# structlog loggers satisfy the ChaosLogger protocol directly
configure_chaos_logger(lambda name: structlog.get_logger(name))
No adapter needed — any logger with debug, info, warning, and error methods works out of the box.
TypeScript vs. Python: Side by Side

Get Started
pip install mcp-chaos-monkey
import os
os.environ["CHAOS_ENABLED"] = "true"
from mcp_chaos_monkey import ChaosController, ErrorFault
controller = ChaosController.get_instance()
fault_id = controller.inject("weather-api", ErrorFault(status_code=503))
fault = controller.get_fault("weather-api")
# → ErrorFault(status_code=503, message=None)
controller.clear(fault_id)
- GitHub: github.com/alexey-tyurin/mcp-chaos-monkey
- PyPI: pypi.org/project/mcp-chaos-monkey
- Production example: github.com/alexey-tyurin/reliable-mcp
- Part 1: MCP Reliability Playbook
- Part 2: I Built a Chaos Monkey for MCP — Here’s Why and How
If you’re building MCP agents in Python and care about what happens when things go wrong — give it a try. Issues, PRs, and stars are welcome.
Alexey Tyurin is the author of mcp-chaos-monkey and reliable-mcp. He writes about building resilient AI systems with MCP.
mcp-chaos-monkey Now Speaks Python was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/mcp-chaos-monkey-now-speaks-python-d0cd60bfd29e?source=rss—-e52cf94d98af—4
