Operational SRE/DBA Flywheel: Ops Knowledge That Compounds. Automatically. Improving with every incident.

Not every Ops team openly admits it, but most rediscover the same incidents. The same max_connections exhausted fix gets worked out from scratch at 2am, independently, by three different engineers over eighteen months. The knowledge exists. It lives in a Slack thread from last October, in a postmortem PDF nobody reads, and in the head of the one engineer who handled it twice before.
This story should very familiar and it brings up a few questions:
- What if there was a way for every incident to make the next one easier?
- What if there was an Ops library that could write itself?
- What if there was a loop to make your AI SRE smarter with every incident?
Or to summarize, what if there was a mechanism to transform an incident into institutional memory? Automatically. Making that memory smarter with every incident.
That’s what aiHelpDesk’s Vault is.
What the Vault Is
The Vault is where all Incidents, Faults and Playbooks live. A Playbook is not a runbook. In particular, it’s not a static sequence of steps written once and followed literally. Instead, a Playbook encodes intent and expertise that the fleet planner uses to generate an execution plan dynamically, against the current state of your infrastructure. The same Playbook produces different steps when your database configuration differs or when new tools are available. No stale scripts.
They say that the only constant thing in life is change. Well, we contend that this more than applies to the modern IT Operations too. Especially now because we believe that with the upcoming avalanche of AI generated apps that are about to hit your Production systems, the rate of change is about to explode. Are you ready? aiHelpDesk can help because the Playbooks evolve and are always aligned with your database and infrastructure changes.
At aiHelpDesk Beta we ship seven system Playbooks out of the box, covering the most common PostgreSQL triage scenarios: connection exhaustion, slow queries, lock contention, vacuum bloat, replication lag, and database-down recovery. They’re in the Vault on day one.
INFO msg="playbooks: seeded system playbook" name="Connection & Lock Triage" series=pbs_connection_triage version=1.0 active=true
INFO msg="playbooks: seeded system playbook" name="Database Down — Configuration Recovery" series=pbs_db_config_recovery version=1.1 active=true
INFO msg="playbooks: seeded system playbook" name="Database Down — Backup Restore & PITR" series=pbs_db_pitr_recovery version=1.1 active=true
INFO msg="playbooks: seeded system playbook" name="Database Down — Restart Triage" series=pbs_db_restart_triage version=1.1 active=true
INFO msg="playbooks: seeded system playbook" name="Replication Lag Triage" series=pbs_replication_lag version=1.0 active=true
INFO msg="playbooks: seeded system playbook" name="Slow Query Triage" series=pbs_slow_query_triage version=1.0 active=true
INFO msg="playbooks: seeded system playbook" name="Vacuum & Bloat Triage" series=pbs_vacuum_triage version=1.0 active=true
INFO msg="playbooks: seed complete" seeded=7 skipped=0
But what makes the Vault interesting is what happens after day one.
The Operational SRE/DBA Flywheel
This is a simplified (and less artistic) diagram of what we presented earlier:
┌──────────────────────────────────────────────────────────────────┐
│ │
│ Fault Agent diagnoses Playbook │
│ (injected or real) ──► correctly ──────────────► remediates │
│ ▲ │ │
│ │ │ │
│ │ ▼ │
│ Library improves ◄── Human approves ◄── Draft auto-saved │
│ (activated) (Vault review) to Vault │
│ │
└──────────────────────────────────────────────────────────────────┘
Two paths feed drafts into the Vault automatically:
⚙️ The faulttest remediation mode: Inject a known fault against your staging database, let the agent diagnose it, trigger the linked Playbook, verify recovery. If it passes, the tool trace of what actually worked is synthesized into a Playbook draft and saved to the Vault.
🚨 The Incident agent: When a real production incident resolves and the agent calls create_incident_bundle with outcome=”resolved”, the same synthesis happens from the live audit trail. No extra steps. It fires automatically as a side-effect of closing the incident.
In both cases: the system produces a draft. A human reviews and activates it. Nothing is ever promoted to production use without that gate.
Step 1: Check Your Current Coverage
vault list shows the full fault catalog alongside the linked Playbooks and the last run result for your target database server:
./faulttest vault list \
- gateway http://localhost:8080 \
- api-key $HELPDESK_CLIENT_API_KEY
FAULT PLAYBOOK LAST RUN STATUS
---------------------- --------------------- ---------- ------
db-max-connections pbs_connection_triage 2026–04–16 PASS
db-lock-contention pbs_connection_triage 2026–04–14 FAIL
db-long-running-query pbs_slow_query_triage (never) -
db-idle-in-transaction pbs_connection_triage (never) -
db-table-bloat pbs_vacuum_triage (never) -
db-replication-lag pbs_replication_lag (never) -
db-connection-refused pbs_db_restart_triage 2026–04–15 PASS
On a fresh deployment with the seeded system Playbooks, most faults show “ — ”, which means that a Playbook is linked to a fault, but never validated against your specific environment. That’s your starting point. The “—” entries are not coverage gaps. Instead, they should be viewed as opportunities.
┌────────────────────┬───────────────────────────────────────────────────────┐
│ Status │ Meaning │
├────────────────────┼───────────────────────────────────────────────────────┤
│ PASS │ Fault injected, agent diagnosed correctly, |
| | Playbook remediated, recovery verified │
├────────────────────┼───────────────────────────────────────────────────────┤
│ FAIL │ Something in the chain didn't work │
├────────────────────┼───────────────────────────────────────────────────────┤
│ - │ Playbook linked, but never run against this target │
├────────────────────┼───────────────────────────────────────────────────────┤
│ NO PLAYBOOK │ No remediation linked in the fault catalog │
├────────────────────┼───────────────────────────────────────────────────────┤
│ PLAYBOOK NOT FOUND │ Series ID configured but not registered on the Gateway│
└────────────────────┴───────────────────────────────────────────────────────┘
Step 2: Run the First Remediation Pass
Pick a fault and let the full cycle run. Obviously this works on any of the three aiHelpDesk supported platforms, but the Docker Compose route with the out of the box provided Postgres (which can be started together with the rest of the stack via the “ — profile dev” option, as documented in this guide) is probably the simplest to get started:
./faulttest run \
- conn "host=localhost port=5432 dbname=postgres user=postgres password=devpassword" \
- db-agent http://localhost:8080 \
- api-key $HELPDESK_CLIENT_API_KEY \
- gateway http://localhost:8080 \
- remediate \
- id db-max-connections \
- judge - judge-vendor anthropic \
- judge-model claude-haiku-4–5–20251001 \
- judge-api-key "$ANTHROPIC_API_KEY"
Here’s what happens in sequence:
- faulttest injects the fault, in the example above it sets max_connections=3 on the database, starving new connections
- Sends a diagnostic prompt to the database agent via the Gateway
- Scores the model diagnosis (keyword matching + LLM judge)
- Triggers the linked Playbook (pbs_connection_triage) via the Gateway
- Polls the database until pg_stat_activity confirms recovery
- Appends the run to ~/.faulttest/history.json (which is the default fault injection testing location, which can be changed by setting $HELPDESK_FAULT_HISTORY_FILE)
A passing run looks like this:
[PASS] Max connections (db-max-connections) - score: 91%
Diagnosis: "Agent correctly identified max_connections=3 as the
root cause, recommended PgBouncer as the long-term fix."
Remediation: RECOVERED in 4.2s
Vault: draft saved → pb_faulttest_a1b2c3
The draft is now in the Vault as source=generated, is_active=false. Nothing has changed in production.
Step 3: Review and Activate the Draft
# List pending drafts
curl -s "http://localhost:8080/api/v1/fleet/playbooks?source=generated" \
-H "Authorization: Bearer $HELPDESK_CLIENT_API_KEY" \
| jq '.playbooks[] | select(.is_active == false) | {id, name, created_at}'
# Read the draft guidance
curl -s http://localhost:8080/api/v1/fleet/playbooks/pb_faulttest_a1b2c3 \
-H "Authorization: Bearer $HELPDESK_CLIENT_API_KEY" \
| jq -r '.guidance'
The draft captures the actual sequence of tool calls the agent made during the successful remediation. Not a hypothetical, but the actual record of what worked, in your environment, against your database configuration.
Review it. If it looks right, activate that Playbook:
curl -s -X POST http://localhost:8080/api/v1/fleet/playbooks/pb_faulttest_a1b2c3/activate \
-H "Authorization: Bearer $HELPDESK_CLIENT_API_KEY"
Activation promotes this version in its series. It’s no longer a draft. The previous active version becomes inactive, but it is preserved in history. From this point on, every incident that matches this fault class has a validated, versioned remediation available.
If the draft isn’t quite right, update it before activating:
curl -s -X PUT http://localhost:8080/api/v1/fleet/playbooks/pb_faulttest_a1b2c3 \
-H "Authorization: Bearer $HELPDESK_CLIENT_API_KEY" \
-H "Content-Type: application/json" \
-d '{"guidance": "Updated guidance based on operator review…"}'
Or discard it entirely:
curl -s -X DELETE http://localhost:8080/api/v1/fleet/playbooks/pb_faulttest_a1b2c3 \
-H "Authorization: Bearer $HELPDESK_CLIENT_API_KEY"
Step 4: Track the Library Over Time
Run faulttest on a schedule, weekly CI is a good cadence, and use vault status to watch the trend:
./faulttest vault status - since-days 30
=== Vault Status - staging-db (last 30 days, 4 runs) ===
DATE RUN ID PASS RATE
---------- -------- ---------
2026–04–01 a1b2c3d4 70% (7/10)
2026–04–08 e5f6g7h8 80% (8/10)
2026–04–15 i9j0k1l2 90% (9/10)
2026–04–22 m3n4o5p6 90% (9/10)
A rising pass rate means the library is improving. That is, more faults get validated, more Playbooks confirmed to work in your environment. A flat or declining rate means something changed: a new PostgreSQL version, an infrastructure configuration drift, a Playbook that no longer applies cleanly.
Either way, you know before the next production incident, not during it.
For catching quiet regressions specifically:
./faulttest vault drift - since-days 90
=== Vault Drift - staging-db (last 90 days) ===
FAULT FIRST HALF SECOND HALF DRIFT
------------------ ---------- ----------- -----
db-lock-contention 100% 50% ▼ -50%
A 50-point drop over 90 days means something changed around db-lock-contention. Investigate now. vault drift on a weekly schedule is the difference between catching this in a 30-minute planned investigation and discovering it as part of the Production emergency at 2am.
Where Real Incidents Fit In
faulttest validates the library against the failure scenarios you designed. Real incidents fill in what you didn’t anticipate.
When the Incident agent resolves a production incident, it fires the same synthesis automatically. An engineer closes the incident with outcome=”resolved”, and the audit trace of every tool call made during the investigation becomes a draft in the Vault. The engineer on the next similar incident, who may have joined just last month, doesn’t start from scratch. They start from a validated draft of what worked last time, with a known track record and the automatically documented history.
The Vault grows from two directions simultaneously:
- From testing: faulttest validates and improves known scenarios on a schedule
- From production: real incidents add the scenarios the catalog didn’t anticipate
What This Compounds Into
After six months:
A new engineer joins. Instead of learning what db-lock-contention means from a postmortem PDF, they look at vault list and see eight validated remediations with a 90% pass rate and a traceable history.
A PostgreSQL 17 upgrade silently changes the locking behavior. vault drift catches it in the next weekly CI run. The fix takes an afternoon. Without the Vault, it would surface as a failed 2am on-call emergency.
But there’s more (and we find this part very exciting): An Incident agent handles a novel connection pool saturation scenario, different from anything in the built-in catalog. The auto-draft captures the agent’s approach. A senior DBA reviews it, tightens the escalation threshold, activates it. The next time it happens, whoever is on-call handles it without escalating.
That is what “institutional memory that compounds” looks like operationally.
Next Steps
- Review the quick startup guides for deploying aiHelpDesk directly on a VM, on Docker/Podman or on K8s
- Set up weekly faulttest CI with “ — remediate” option to keep the Vault current
- Connect to your real staging database and run the full external fault testing suite
- Add custom fault scenarios for failure modes specific to your environment, as well as the custom or built-in playbooks to remediate these scenarios
Related Reading
- Your SRE On-Call Runbook Is Already Obsolete. Here’s Why That’s Not Your Fault: Introducing aiHelpDesk Operational SRE/DBA Flywheel
- The Missing Test Suite for AI Database Operations: You’re about to bet your SRE/DBA on-call rotation on an AI agent. Want to know if it’s any good before the 2am page goes off?
aiHelpDesk is available in Beta. If you’re running PostgreSQL in production and would like to get help in preparing for the avalanche, consider aiHelpDesk. Reach out to us at info@aiHelpDesk.biz and we’ll be happy to show you what it looks like in practice.
Runbooks Rot. Playbooks Learn. was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/runbooks-rot-playbooks-learn-bec433817938?source=rss—-e52cf94d98af—4
