mcp-chaos-monkey Now Speaks Python

MCP Reliability Series — Part 3: mcp-chaos-monkey for Python

In Part 2 of this series, I introduced mcp-chaos-monkey — a fault injection framework purpose-built for MCP pipelines. That article covered the design philosophy, the Netflix Chaos Monkey inspiration, and the core architecture using Node.js/TypeScript. If you haven’t read it yet, I recommend starting there.

This article covers what’s new: a native Python implementation of mcp-chaos-monkey — same design, same 8 fault types, same API shape, but built from the ground up for the Python ecosystem.

Why a Python Port?

The MCP ecosystem is bilingual. The official MCP SDK ships in both TypeScript and Python. Teams building MCP agents use FastAPI, Starlette, and httpx just as often as Express and fetch. If your resilience testing framework only speaks one language, half the ecosystem can’t use it.

But this isn’t a mechanical translation. Python has its own idioms — dataclasses instead of discriminated unions, threading.Lock instead of single-threaded event loops, asyncio.sleep alongside time.sleep, ASGI instead of Express. The Python implementation respects these idioms while maintaining API compatibility with the TypeScript version.

The result: you can use the same chaos scenarios, the same admin endpoints, and even the same JSON fault configs across both languages.

What Stays the Same

If you’ve read Part 2, you already know the core architecture. Here’s a quick recap of what carries over unchanged:

8 fault types: latency, error, timeout, malformed, connection-refused, connection-drop, rate-limit, schema-mismatch
Transport-level injection: faults sit below the resilience layer, above the transport — so circuit breakers, retries, and timeouts trigger organically
ChaosController singleton: central registry for active faults with auto-expiration, probability gating, and request counting
Production safety guards: refuses to run when ENVIRONMENT=production or CHAOS_ENABLED is not true
Zero runtime dependencies: interceptors use optional deps
Admin REST API + CLI for runtime fault management
Scenario builder for reproducible chaos tests

The architecture diagram from Part 2 applies directly — just swap the transport layer names:

Now let’s look at what’s different under the hood.

Dataclass Fault Types Instead of Discriminated Unions

In TypeScript, fault types are modeled as a discriminated union — a single FaultConfig type where the type field narrows the available fields at compile time. Python doesn't have discriminated unions, but it has something equally expressive: dataclasses.

Each fault type is its own dataclass with only the fields it needs:

@dataclass
class LatencyFault:
    delay_ms: int
    probability: float | None = None
    type: str = field(default="latency", init=False)

@dataclass
class ErrorFault:
    status_code: int
    message: str | None = None
    probability: float | None = None
    type: str = field(default="error", init=False)

@dataclass
class TimeoutFault:
    hang_ms: int
    probability: float | None = None
    type: str = field(default="timeout", init=False)

The type field is set automatically via init=False — you can't accidentally create a LatencyFault with type="error". And the union type is a simple Python type alias:

FaultConfig = (
    LatencyFault | ErrorFault | TimeoutFault | MalformedFault
    | ConnectionRefusedFault | ConnectionDropFault
    | RateLimitFault | SchemaMismatchFault
)

This gives you the same safety at the API boundary. Type checkers like mypy enforce that you provide delay_ms for a LatencyFault and status_code for an ErrorFault. Try passing the wrong fields and mypy catches it before you run a single test.

Thread-Safe ChaosController

This is the biggest architectural difference from TypeScript.

Node.js runs on a single-threaded event loop. JavaScript’s Map doesn't need locking because only one piece of code executes at a time. Python has real threads. If your MCP agent uses threading or concurrent.futures (and many do), concurrent access to the fault registry will corrupt it.

The Python ChaosController uses threading.Lock for every operation:

class ChaosController:
    _instance: ChaosController | None = None
    _instance_lock = threading.Lock()

    def __init__(self) -> None:
        assert_chaos_allowed()  # Safety guard — throws in production
        self._faults: dict[str, _ActiveFault] = {}
        self._lock = threading.Lock()

    @classmethod
    def get_instance(cls) -> ChaosController:
        if cls._instance is None:
            with cls._instance_lock:
                if cls._instance is None:
                    cls._instance = cls()
        return cls._instance

    def inject(self, target, config, duration_ms=None) -> str:
        suffix = "".join(random.choices(string.ascii_lowercase + string.digits, k=6))
        now = time.time() * 1000
        fault_id = f"{target}-{int(now)}-{suffix}"
        with self._lock:
            self._sweep_expired(now)
            if len(self._faults) >= MAX_FAULTS:
                raise ValueError("Maximum number of active faults exceeded")
            self._faults[fault_id] = _ActiveFault(
                target=target, config=config,
                activated_at=now,
                expires_at=now + duration_ms if duration_ms is not None else None,
                request_count=0,
            )
        return fault_id

Key design decisions that differ from TypeScript:

Double-checked locking on get_instance() — the classic pattern for thread-safe singletons in Python
Separate locks for instance creation (_instance_lock) and fault operations (_lock) — avoids contention between first-access and normal operations
Lock-protected sweep — expired faults are cleaned up inside the lock to prevent race conditions during iteration
Memory bound — MAX_FAULTS = 1000 prevents unbounded growth if test code has a leak

The get_fault() method demonstrates why locking matters — it iterates the fault map, checks expiration, rolls probability dice, and increments a counter, all atomically:

def get_fault(self, target: FaultTarget) -> FaultConfig | None:
    now = time.time() * 1000
    with self._lock:
        expired: list[str] = []
        matched_config: FaultConfig | None = None
        for fault_id, fault in self._faults.items():
            if fault.target != target:
                continue
            if fault.expires_at is not None and now > fault.expires_at:
                expired.append(fault_id)
                continue
            if matched_config is None:
                if (fault.config.probability is not None
                        and random.random() > fault.config.probability):
                    break  # First-match semantics
                fault.request_count += 1
                matched_config = fault.config
        for eid in expired:
            self._faults.pop(eid, None)
    return matched_config

Without the lock, two concurrent requests could both read request_count, both increment to the same value, and you'd lose a count. In a chaos test that asserts "the circuit breaker tripped after exactly 3 requests," that's a flaky test waiting to happen.

httpx Transport Wrapping Instead of Fetch Interception

The TypeScript version wraps fetch with a higher-order function. Python's MCP SDK uses httpx, which has a cleaner extension point: custom transports.

Instead of wrapping a function, we subclass httpx.AsyncBaseTransport:

class _ChaosAsyncTransport(httpx.AsyncBaseTransport):
    def __init__(self, target: FaultTarget, transport):
        self._target = target
        self._transport = transport

    async def handle_async_request(self, request):
        controller = ChaosController.get_instance()
        fault = controller.get_fault(self._target)

        if fault is None:
            return await self._transport.handle_async_request(request)

        return await _apply_async_fault(fault, request, self._transport)

The fault application uses Python’s match statement (Python 3.10+) — a natural fit for dispatching on fault type:

async def _apply_async_fault(fault, request, transport):
    match fault.type:
        case "latency":
            await asyncio.sleep(fault.delay_ms / 1000)
            return await transport.handle_async_request(request)
        case "error":
            return httpx.Response(
                status_code=fault.status_code,
                content=json.dumps({"error": fault.message or "Chaos injected error"}).encode(),
                headers={"Content-Type": "application/json"},
                request=request,
            )
        case "timeout":
            await asyncio.sleep(fault.hang_ms / 1000)
            raise httpx.ReadTimeout("Chaos timeout", request=request)
        case "connection-refused":
            raise httpx.ConnectError("Chaos: connection refused", request=request)
        case "malformed":
            return httpx.Response(
                status_code=200,
                content=b"<<<CORRUPTED_RESPONSE>>>{{{{not json",
                headers={"Content-Type": "application/json"},
                request=request,
            )
        case "connection-drop":
            raise httpx.ReadError("Chaos: connection dropped", request=request)

Notice the exception types: httpx.ReadTimeout, httpx.ConnectError, httpx.ReadError. These are the exact same exceptions your code will see in production when httpx encounters real network failures. This is not mocking — this is producing identical error signatures.

A key advantage of the transport approach: both async and sync clients are supported. The TypeScript version only wraps async fetch. The Python version ships both _ChaosAsyncTransport and _ChaosSyncTransport, so you can chaos-test synchronous code too:

from mcp_chaos_monkey.interceptors import create_chaos_aware_client
from mcp_chaos_monkey.interceptors.http_interceptor import create_chaos_aware_client_sync

# Async
async_client = create_chaos_aware_client("weather-api", httpx.AsyncClient())

# Sync
sync_client = create_chaos_aware_client_sync("weather-api", httpx.Client())

Redis: Monkey-Patching Instead of Proxy

The TypeScript version wraps ioredis with a proxy object. Python’s redis-py library doesn't have a proxy-friendly API, so the Python interceptor uses monkey-patching — replacing methods on the client instance at runtime:

unwrap = wrap_redis_with_chaos(redis_client, "redis")

# Now redis_client.get(), .set(), .delete(), etc. check for chaos faults
controller.inject("redis", ConnectionRefusedFault())
redis_client.get("key")  # → raises ConnectionError

# Restore original behavior
unwrap()

The wrap_redis_with_chaos function detects whether it's wrapping a sync redis.Redis or async redis.asyncio.Redis client and patches accordingly. It returns an unwrap function for clean teardown — essential for test isolation.

ASGI Middleware Instead of Express Middleware

Where TypeScript uses Express middleware for auth fault injection, Python uses ASGI middleware — the universal interface for Starlette, FastAPI, and any ASGI-compatible framework:

rom starlette.applications import Starlette
from mcp_chaos_monkey.interceptors import ChaosAuthMiddleware

app = Starlette(routes=[...])
app = ChaosAuthMiddleware(app, target="oauth-token")

# Inject auth failures
controller.inject("oauth-token", ErrorFault(status_code=401, message="Token expired"))
# All requests → 401 {"error": "token_invalid", "message": "Token expired"}

# Inject auth latency
controller.inject("oauth-token", LatencyFault(delay_ms=2000))
# All requests → 2s delay, then forwarded normally

With FastAPI, you can stack multiple auth targets:

from fastapi import FastAPI
from mcp_chaos_monkey.interceptors import ChaosAuthMiddleware

app = FastAPI()
app.add_middleware(ChaosAuthMiddleware, target="oauth-token")
app.add_middleware(ChaosAuthMiddleware, target="api-key")

Cross-Language Compatibility

Since many teams run both TypeScript and Python MCP services, the fault config format is intentionally compatible. The parse_fault_config() function accepts both camelCase (TypeScript convention) and snake_case (Python convention):

from mcp_chaos_monkey import parse_fault_config

# Both produce the same LatencyFault(delay_ms=2000)
fault = parse_fault_config({"type": "latency", "delayMs": 2000})     # camelCase
fault = parse_fault_config({"type": "latency", "delay_ms": 2000})    # snake_case

This means a shared admin API or CI pipeline can inject faults into both TypeScript and Python services using the same JSON payloads.

Starlette Admin Endpoints

The admin API ships as framework-agnostic handler functions plus optional Starlette route registration:

from mcp_chaos_monkey.admin_endpoint import create_starlette_routes
from starlette.applications import Starlette

routes = create_starlette_routes()
app = Starlette(routes=routes)

This registers 4 routes:

Admin endpoints are protected by a timing-safe token comparison using hmac.compare_digest — set CHAOS_ADMIN_TOKEN to enable authentication.

The handler functions (handle_status, handle_inject, handle_clear, handle_clear_all) are framework-agnostic, so you can wire them into Django, Flask, or any other framework with minimal glue code.

CLI and Scenario Builder

These work identically to the TypeScript version. The CLI installs as a console script:

pip install mcp-chaos-monkey
export CHAOS_ENABLED=true

mcp-chaos inject weather-api error --status 503
mcp-chaos inject redis latency --delay 2000 --duration 30
mcp-chaos status
mcp-chaos clear-all

The scenario builder uses Python dataclasses instead of TypeScript objects, but the shape is the same:

from mcp_chaos_monkey import define_scenario, ScenarioFault, TimeoutFault, ErrorFault

api_timeout = define_scenario(
    name="weather-api-timeout",
    description="Weather API hangs for 10s — timeout fires, retries exhaust",
    faults=[
        ScenarioFault(target="weather-api", config=TimeoutFault(hang_ms=10_000)),
    ],
    expected_behavior="Each attempt times out at 5s. Retries fire and all fail.",
    assertions=[
        "Each retry attempt hits timeout at 5s",
        "Total latency bounded by retry * timeout",
    ],
)

Pluggable Logger

The TypeScript version ships with a pino-compatible logger interface. The Python version uses a Protocol class that matches Python's stdlib logging.Logger:

from mcp_chaos_monkey import configure_chaos_logger
import structlog

# structlog loggers satisfy the ChaosLogger protocol directly
configure_chaos_logger(lambda name: structlog.get_logger(name))

No adapter needed — any logger with debug, info, warning, and error methods works out of the box.

TypeScript vs. Python: Side by Side

Get Started

pip install mcp-chaos-monkey

import os
os.environ["CHAOS_ENABLED"] = "true"

from mcp_chaos_monkey import ChaosController, ErrorFault

controller = ChaosController.get_instance()
fault_id = controller.inject("weather-api", ErrorFault(status_code=503))

fault = controller.get_fault("weather-api")
# → ErrorFault(status_code=503, message=None)

controller.clear(fault_id)

GitHub: github.com/alexey-tyurin/mcp-chaos-monkey
PyPI: pypi.org/project/mcp-chaos-monkey
Production example: github.com/alexey-tyurin/reliable-mcp
Part 1: MCP Reliability Playbook
Part 2: I Built a Chaos Monkey for MCP — Here’s Why and How

If you’re building MCP agents in Python and care about what happens when things go wrong — give it a try. Issues, PRs, and stars are welcome.

Alexey Tyurin is the author of mcp-chaos-monkey and reliable-mcp. He writes about building resilient AI systems with MCP.

mcp-chaos-monkey Now Speaks Python was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/mcp-chaos-monkey-now-speaks-python-d0cd60bfd29e?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

Introducing Fireworks AI on Microsoft Foundry: Bringing high performance, low latency open model inference to Azure

Reduce 429 errors on Vertex AI

Many agents, one team: Scaling modernization on Azure

You may have missed

Introducing Fireworks AI on Microsoft Foundry: Bringing high performance, low latency open model inference to Azure

Reduce 429 errors on Vertex AI

The AI will see you now: Local therapists discuss artificial intelligence in mental health – WGEM