Model’s autonomy is a bug you ship when you haven’t written the spec. That spec gives you structure. Structure isn’t a constraint on LLM capability — it’s the source of it. An unguided model finds what’s easy. A guided one finds what’s true.

James Pritchard published an important post recently titled “LLMs are functions, not brains”. I highly recommend reading it. This post deeply resonates because our experience in lab experiments and building aiHelpDesk aligns well with the post’s primary conclusions.
There are also a few minor things that we see differently. Let’s review them in turn:
Alignment
The blog’s central claim is exactly what we found the hard way as well and what maps cleanly onto aiHelpDesk’s approach. Rather than treating LLMs as autonomous reasoners deciding their own execution path, treat it via a contract built on the typed functions with strongly typed input, clean typed output, with one transformation per call and all composable with the regular code.
In aiHelpDesk we coined a term and invented a concept of a Crystal Ball. If you are not familiar with it, in a nutshell, it’s the “bare” LLM grounded into your environment, but with no playbook scaffolding and no structured guidance. Just the model and the Tool Registry.
See here and here for the background. The Crystal Ball concept in aiHelpDesk is important because it highlights the methodology we invented that we refer to as the Operational SRE/DBA Flywheel (blog post, doc).
How is it relevant here? Well, the Crystal Ball vs. the guided playbook comparison that we now use for every new fault injection test is almost exactly the agent vs. workflow benchmark presented in James’s blog. Same tools, same model, the only variable is who specifies the execution path. Crystal Ball is the agent case. The playbook is the workflow case.
In my opinion the token overhead the blog documents (490% more, 3x slower, identical output quality) is a relatively mild failure mode. What we found is that it could get much worse. How? Well, in case of a diagnosis, the agent not only wastes tokens reasoning about what to call, it could easily reason itself into a wrong conclusion early. See this example for instance where maxwritten_clean=0 throws an agent to the “bgwriter healthy” conclusion … where it stops and never reaches the steps that would (likely) correct it. Contrast that to a playbook that prevents early termination by specifying the execution path.
Divergence
While I absolutely agree with James’s main point, the nature of our business (SRE/DBA database troubleshooting) is different and so I believe our experience in building aiHelpDesk makes us diverge in the following:
The blog assumes that you know the function signature at design time and so you can hard-code it (in Python). If I understand the examples in the post, for billing tickets or lead routing, you indeed do. The steps are likely to be always extract, which leads to a predefined classification, compression and a route.
The thing about diagnostic SRE work is that the playbooks themselves have to be selected based on symptom classification first. That’s not static. There’s effectively a meta-layer. Now the symptom classifier is indeed a typed function (that returns a playbook ID). The playbook itself is indeed a typed function (that returns a hypothesis). They compose, but the composition happens at runtime based on what the alert looks like, not at design time.
Another point worth mentioning is that the blog appears to be solving a case of the stateless transformations. aiHelpDesk’s diagnostic functions are stateful folds where each tool result becomes evidence that the next step reasons about. For instance, a DB Agent starts the diagnosis, presents the findings and comes up with a few theories (that we refer to as hypotheses). However it’s not uncommon that the confidence level, that aiHelpDesk requires for each hypothesis (accepted or rejected), is not high enough to propose one as the final RCA verdict. There could be more than one competing candidates with similar confidence levels, which results in the escalation to another agent (most often to the SysAdmin Agent or the K8s Agent) or to a human if the evidence is inconclusive. aiHelpDesk playbook structure is what drives this process and in particular the HYPOTHESIS format dictated by a playbook is the accumulator. And that appears to be a harder case than compress(ticket, maxLength=60).
But in principle, the compression failure is similar to the failure diagnosis or remediation we are dealing with in aiHelpDesk while troubleshooting SRE/DBA database problems.
Indeed, if you look beyond the superficial differences, there appears to be a clear parallel with the preserve=[“CRM”, “budget”, “timeline”] example. The blog says naive compress drops the Slate CRM requirement because it’s one line out of many: locally correct (it’s short), globally wrong (it’s the make-or-break detail). Well, that’s exactly what we found with the maxwritten_clean=0 (which feels like a trap): locally correct inference (no write activity logged, so bgwriter looks healthy, right?), globally wrong conclusion (the parameter that caused the problem is sitting right there in pg_settings).
And so it’s no surprise than that the fix in both cases is essentially same: specify in the function contract what must appear in the output. For compress, that’s a preserve list. For aiHelpDesk SRE/DBA diagnosis, that’s the EVIDENCE and CONFIDENCE fields in the HYPOTHESIS format, which force the model to surface what artifacts specifically it looked at, which specific parts in the artifacts were of interest and how it affected its analysis.
Disagreement
There’s also a point where our experience in building aiHelpDesk disagrees with James. The blog says “you don’t need an orchestration framework, you just call them.” Well, for aiHelpDesk’s diagnostic case, you do. The escalation chains (ESCALATE_TO: pbs_sysadmin_docker_inspect), conditional branching based on tool results and the multi-agent routing can hardly be viewed as anything, but orchestration, can it?
However let’s not loose sight of the James’s main point, which very much still stands: this orchestration lives in the playbook specification (English + YAML), not in agent’s autonomy. The human (that encodes their operational experience in the form of a playbook) decides on the path to drive the investigation logic. The LLM reviews the artifacts, zeros on the important bits, runs the analysis, comes up with the hypotheses, presents its supporting evidence, proposes one as RCA or escalates. In a nutshell, the LLM executes the logic, but doesn’t decide on it.
Different starting points, same destination
James’s blog is effectively arguing against over-engineering. It feels like in aiHelpDesk we started from the opposite direction. We started simple(minded). We tried initially the unstructured system prompting and watched it fail miserably on various, sometimes subtle, inference traps. It became obvious that we can’t get the consistency, reliability and explainability we were looking for. That is, there was no way we could slap a decision making SLO on a diagnosis, let alone remediation.
And so to tame that decision reliability problem we made the first pivot and started adding the anti-hallucination safeguards, which eventually grew into a sophisticated multi-layered fabrication detection system, part of the eight-module AI Governance harness. But all of this just helped us track bad LLM decisions after the fact. Yes, our flashing fabrication warning signs ⚠️ alert a customer when AI gets “creative” and reports an API call to the Tool Registry, which didn’t actually happen because the real-time check against our tamper-proof Audit didn’t actually register it, so this LLM claim is nothing but a figment of the model’s imagination. So sure, this and other sanity checks helped us build the foundation of the trusted system, but they didn’t get us closer to get reliability in decision making.
It didn’t get us closer to getting the full, clear explanations on what artifacts were reviewed. What specific exceprts from these artifacts were significant in the context of the problem at hand. How they affected the analysis. How the candidate hypotheses were formed and supported by the collected evidence. Was the evidence sufficient or the confidence was low and warranted an escalation. etc, etc.
That’s when we pivoted again and arrived to the structured scaffolding of the playbooks.
At this point we find that it doesn’t matter what model to use. Really. In fact if you discover that aiHelpDesk diagnosis works for one model and not the next, please let us know. We treat it as a bug that tells us that the playbook guidance isn’t tight enough.
So unless you rely on “bare” LLMs, the models effectively become a (disposable) commodity. Our Crystal Ball mode is an example of a “bare” LLM and yes, in this case you are at the mercy of a model’s free-form prose with no structure, no consistency and no accountability.
So you don’t need the most sophisticated LLMs. Not for the agents, not even for the LLM-as-Judge. What you do need is a guided, structured, 100% audited and fully explainable decision process. And that’s exactly what we get with the playbooks, which we verify against the faults we constantly run via our failure injected testing system as well as through the real incidents. We call it Operational SRE/DBA Flywheel. A system that learns from every incident, injected or real. See here and here for details.

So yes, we maybe started from different points, but we arrived to the same destination! The bottom line is this:
Autonomy is what you give an LLM when you haven’t thought hard enough about what you want.
Related Reading
- aiHelpDesk Flywheel official documentation
- Your SRE On-Call Runbook Is Already Obsolete. Here’s Why That’s Not Your Fault: Introducing aiHelpDesk Operational SRE/DBA Flywheel
- The Missing Test Suite for AI Database Operations: You’re about to bet your SRE/DBA on-call rotation on an AI agent. Want to know if it’s any good before the 2am page goes off?
- Runbooks Rot. Playbooks Learn: Operational SRE/DBA Flywheel: Ops Knowledge That Compounds. Automatically. Improving with every incident.
- We Wanted a Dramatic AI Agent Failure. We Got Something Better:
When the Flywheel works: The K8s WAL fault that made us rethink what playbooks are for - AI Database Troubleshooting: the PostgreSQL Stat That Looks Like Good News (But Ain’t): What a bgwriter incident taught us about the difference between reading data and understanding it
aiHelpDesk is available in Beta. If you’re running PostgreSQL in production and would like to get help in preparing for the avalanche, consider aiHelpDesk. Reach out to us at info@aiHelpDesk.biz and we’ll be happy to show you what it looks like in practice.
LLMs are Functions, not Brains — aiHelpDesk perspective was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/llms-are-functions-not-brains-aihelpdesk-perspective-e12e5432a9ed?source=rss—-e52cf94d98af—4
