You got informed consent. Can you prove the AI was right?

A sequel to “You let AI operate on a production database without your consent?”

The first post ended with a gate.

A gate is what we call an optional, but highly recommended pause between AI diagnosis and AI action. It’s a triage screen that shows the operator exactly what the agents found, what they propose to do about it, the blast radius of doing it and the right to say no. Enter the Informed Consent framework borrowed from medical ethics: you cannot operate without telling the patient the diagnosis, proposing the treatment and getting explicit agreement.

That post made the case for why the gate has to exist. This one asks the harder question that comes after: you said “yes” at the gate, but was the diagnosis you consented to, actually correct? And how does your AI system provide dead-easy proof that it delivered exactly as promised?

Because consent is where the conversation begins, not where it ends.

The gap nobody talks about

Good AI SRE tools claim transparency. They show you a diagnosis before they act. They present a remediation plan upfront. Some even let you approve individual, especially destructive, steps. That’s table stakes.

In my observations, what is largely missing in most AI SRE tools are the answers to the following two questions:

For your particular failure scenario…

How often the diagnosis was actually right.
Whether the remediation was actually appropriate.

These questions should be asked at the decision time. And the historical answers to these two questions should be presented to an operator, so that they can make a more informed decision. Because these answers reveal how a similar failure scenario was handled in the past. Right there on the same triage screen that the current on-caller needs to make a decision. NOW!

This matters more than it sounds. An operator at 2am staring at a screen that says “Root blocker PID 867 holds a transaction lock; terminate it” is not running their own analysis. They are evaluating an AI claim under time pressure. If that claim is correct 91% of the time, they have more confidence that the track history of AI diagnosis of this particular problem is good and so they could consider approving it. If it’s correct 60% of the time, they should probably investigate before acting. And if the AI reports 95% confidence in both cases — which number do they use?

Here’s the thing: without measurement, the gate is a ritual. The operator clicked approve because the screen looked authoritative, the clock was ticking, VP was looking over their shoulder and there was no better option. That is not informed consent. That is informed-looking consent.

To turn the gate from a ritual into a measured property, aiHelpDesk features a feedback loop designed to facilitate the Informed Consent, which is an integral part of the Operational SRE/DBA Flywheel (see here and here):

aiHelpDesk Feedback Loop

Every gate interaction generates a feedback record. Two questions, captured in sequence:

At the gate, before remediation runs:
— Is the diagnosis correct?
— Is the proposed remediation approach appropriate?

After recovery completes (aka post-incident):
— Looking back: was the diagnosis correct?
— Looking back: was the remediation appropriate?

These are not survey fields bolted on after the fact. They are the first-class records in the audit store, keyed to the same run ID as the playbook execution, the step log and the automated evaluation score. The gate is where the human verdict enters the system. Everything else is downstream from here.

Who is the primary consumer of this data? Meet the vault commands:

vault drift → Which faults are regressing over time?
vault versions → Did the last playbook update help or hurt?
vault accuracy → How often does the agent actually diagnose correctly?
vault calibration → Can I trust the automated scores?

These commands form a dependency chain. Drift tells you which fault to look at. Versions correlates the regression to a playbook change. Accuracy checks whether diagnosis quality actually improves (or drops). Calibration validates whether the scores you’ve been reading are reliable in the first place.

We recommend working through them in order. Drift is where you start because it surfaces the signal. Calibration is where you end because it validates that signal. The loop closes.

Why the gate is the best place to ask

In our experience, the gate is the best place to ask an operator for a feedback on the AI diagnosis. Because at this time an operator has to make a decision. Not just assessing if the diagnosis is correct, but also, was the diagnosis clearly explained, was it well presented, was it easy to grasp to form… well, an informed opinion?

We find that it’s critical to pose this feedback question before the operator knows whether remediation will succeed. We repeat the same question again post-incident, but that’s already a look in a rear view mirror. Also important, but giving a different signal.

This feedback ordering is deliberate. In our experience, the post-incident feedback is contaminated by outcome bias. An operator who just saw a clean recovery is more likely to confirm the diagnosis regardless of whether they actually understood and evaluated it. The relief of a resolved incident retroactively validates the path that got there.

What’s not what we are after. To improve and evolve triage playbooks, we need honesty. At-gate feedback has no outcome bias contamination. The operator is reading a hypothesis, possibly at 2am in the morning, possibly on their small cell phone screen. Yes, they are also presented with a proposed action plan to fix the problem, but they have no knowledge yet of the remediation outcome. This verdict is the honest one.

This is why vault calibration prefers at-gate feedback over post-incident when both exist for the same run. It is also why, if you are starting a feedback program, and we recommend that you do, the at-gate question matters more than the post-incident one. You get fewer clean shots at unbiased measurement than you think.

That’s our experience anyway. All feedback is optional, which means that an operator is free to skip any and all of our four questions. We advise against it. We encourage the SRE teams include it in their SoP the diligence of providing the feedback. But ultimately, whether you answer all four questions is up to you. This is the description of the feedback loop flow and our recommendations.

If you want to reproduce the findings presented in this blog post, please feel free to check out this doc page for details on the commands to run (for running aiHelpDesk Fault Injection Tests directly on a host/VM in particular or see the other samples for K8s and Docker/Podman), but the excerpts below should be sufficient to make our point:

Calibration is the clincher

After enough feedback accumulates, vault calibration produces something that very few SRE tools do:

Diagnosis calibration — fleet-wide 
(13 runs with eval + operator feedback)

  CONF BAND  RUNS  CORRECT  ACCURACY       CALIBRATION
  ───────── ───── ──────── ───────── ─────────────────
    90-100%    12      11        91%    WELL_CALIBRATED
     70-89%     1       1       100%  INSUFFICIENT_DATA
       <70%     0       0          –  INSUFFICIENT_DATA

Read this table carefully. When the agent reported 90–100% confidence in its diagnosis, it was confirmed correct by operators 91% of the time. The expected accuracy for that band, which is roughly mid-point 95%, is within 10 percentage points of the actual 91%. That is what WELL_CALIBRATED means: the model’s internal confidence score is an honest signal, not a number printed to look authoritative.

Compare this to OVERCONFIDENT, which fires when the actual accuracy falls more than 10 points below the expected band mid-point. If your 90–100% confidence band is only correct 70% of the time, the agent is systematically overstating its certainty. We’ve seen way too many times. And you certainly would want to know that. You would want to lower the autonomous action threshold for that failure scenario until the calibration improves.

This is what verifiable informed consent looks like. Not “the AI told the operator before acting.” But: the operator’s verdict, collected systematically, run against the AI’s stated confidence, checked for consistency. When it checks out, you have evidence. When it doesn’t, you have an early warning.

That was triage. The remediation side produces the same calibration, but independently of triage, because a correct diagnosis can still result in a wrong remediation:

  Remediation calibration — fleet-wide 
  (5 runs with remediation score + operator feedback)

  SCORE BAND  RUNS  CORRECT  ACCURACY     CALIBRATION
  ────────── ───── ──────── ───────── ────────────────
     90-100%     5        5      100%  WELL_CALIBRATED

Five runs is not enough to draw any strong conclusions. But the structure is there and the data accumulates automatically with every faulttest run.

The autonomy gradient

The first post framed full autonomy as something to be suspicious of. That framing holds, but it needs a second half.

Because the full autonomy in auto mode is available in aiHelpDesk too. It is not forbidden and it is not irresponsible … if used in the right circumstances. And those circumstances are very specific: you have run enough fault injection tests to accumulate the calibration data, so that vault calibration shows WELL_CALIBRATED across the confidence bands that matter for your environment and the approval mode you relax to still respects your playbook’s permitted_tools whitelist.

The journey is this:

Run in manual mode. Collect at-gate feedback on every gate interaction.
Watch vault accuracy accumulate. Look for correctness above your bar, whatever that means for your risk tolerance.
Run vault calibration. Confirm the automated scores predict real accuracy.
When both hold, relax the approval mode.

Skipping to step 4 is not faster. It is unverified. The difference between an operator who approved 50 gates with feedback and an operator who skipped the gate entirely is not attitude toward AI. It is evidence. The first operator knows whether and when the AI is right. The second is guessing.

This is also why the gate timeout matters. A gate not resolved within the configured window, transitions to abandoned. Not silently approved. There’s a sharp difference. The system defaults to inaction, not action. Speed is not the goal. Verifiable correctness is.

What we shipped

The feedback loop described her is part of the higher level Operational SRE/DBA Flywheel and it’s live in the current release. Specifically:

At-gate and post-incident feedback for both diagnosis and remediation. All four combinations, captured interactively during faulttest runs or via API for CI/CD pipelines.
vault accuracy showing per-series correctness with a triage and remediation breakdown
vault calibration comparing automated confidence scores to operator-confirmed outcomes, fleet-wide or per fault
vault drift detecting pass-rate regressions before they become production incidents, optionally with a “gateway path” for multi-machine fleet data
vault versions correlating per-playbook-version metrics like step count, recovery time, diagnosis score, remediation score, etc. This is so a bad update is visible before it accumulates too much history

The failure injection test harness that produces all of this data runs against a real Postgres database (either a local, dynamically span Docker if “-auto-db” option is requested or against any BYO instance), injects real failure conditions and captures operator feedback interactively at the exact gate where it matters most.

In a nutshell

The first post asked: did the AI operate with your consent?

This post asks a follow up question: when you consented, were you right to?

Those are different questions. The second one is harder to answer. It requires measurement infrastructure, in particular, feedback capture, evaluation storage, calibration logic, etc. It also requires a philosophical commitment that is easy to skip: the belief that your approval at the gate is a data point that should be tracked, aggregated and used to evaluate the system that asked for your approval.

aiHelpDesk tracks and measures it. The flywheel is the proof.

Does your AI system do this?

aiHelpDesk is open source. The fault injection test harness, the feedback schema and the vault command suite are in the repository. If you run it and your calibration comes back OVERCONFIDENT, that is useful information. Don’t discard it. Work with us to make it WELL_CALIBRATED.

Why aiHelpDesk Playbooks are trustworthy?

Because we give you, the customer, an ability to vet, verify and improve them through the methodology that we refer to as Operational SRE/DBA Flywheel, see here and here for details.

And yes, as a customer, you can not only confirm that the playbooks that you get with aiHelpDesk out of the box work for your environment, but you can easily bring your own. Both: BYO playbooks (by either importing your existing runbooks or cloning and customizing one of our’s or by creating one from scratch) + BYO faults as well.

Because nobody knows your specific databases, your environment and your workload with your upstream/downstream apps better than you do. You know how your database fails. Vendors don’t. At aiHelpDesk, we give you an option to create your own faults and add your own playbooks to triage and rectify them (in addition to the system playbooks we ship, of course).

And yes, we don’t depend on a particular model or a model provider. aiHelpDesk is model agnostic. From our standpoint, the LLMs are a disposable commodity. Flip from Gemini to Anthropic and aiHelpDesk should continue to give you exactly the same diagnosis and remediation. Anything shorter than that is a P0 bug.

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

AWS DevOps Agent adds release management capabilities to assess code changes before production (preview)

Claude Fable 5 available today in Microsoft Foundry: Powering the next era of autonomous agents

Amazon ECS introduces new high-resolution metrics for faster service auto scaling

You may have missed

AWS DevOps Agent adds release management capabilities to assess code changes before production (preview)

Use Claude Code without Anthropic

Claude Fable 5 available today in Microsoft Foundry: Powering the next era of autonomous agents