One of our favorite mottos in Google Site Reliability Engineering (SRE) is: “Eliminate Toil.”
You hear it in the microkitchens in Zurich and the hallways in Mountain View. This motto refers to the SRE mission of replacing repetitive, manual work with engineered systems. But as a senior SRE once explained, this doesn’t just mean writing a script to solve a problem once. It means building the automation that triggers that script at the exact right moment—often the hardest part.
AI has already revolutionized how we write code, but what about how we operate it? Can AI safely solve operational problems? Can it assist operators during a high-pressure outage without taking away control?
In this article, we’ll delve into real scenarios that Google SREs are solving today using Gemini 3 (our latest foundation model) and Gemini CLI—the go-to tool for bringing agentic capabilities to the terminal.
The Scenario: Fighting “Bad Customer Minutes”
Meet Ramón. Ramón is a Core SRE, meaning he works in the engineering group that develops the foundational infrastructure for all of Google’s products: safety, security, account management, and data backends for multiple services.
When infrastructure at this layer has a hiccup, it’s visible across a massive range of products immediately. Speed is vital, and we measure it in Bad Customer Minutes. Every minute the service is degraded burns through our Error Budget.
To combat this, we obsess over MTTM (Mean Time to Mitigation). Unlike Mean Time to Repair (MTTR), which focuses on the full fix, MTTM is about speed: how fast can we stop the pain? In this space, SREs typically have a 5-minute Service Level Objective (SLO) just to acknowledge a page, and extreme pressure to mitigate shortly after.
Our incident usually follows four distinct stages:
-
Paging: The SRE gets alerted.
-
Mitigation: We “stop the bleeding” to reduce Bad Customer Minutes, often before we even know why it broke.
-
Root Cause: Once users are happy, we investigate the underlying bug and fix it for good.
-
Postmortem: We document the incident and add extensive action items on engineering teams, that are prioritised to ensure it never happens again.
Let’s walk through a real (simulated) outage to see how Gemini CLI accelerates every step of this journey to keep MTTM low.
Step 1: Paging and Initial Investigation
Context: It’s 11:00 AM. Ramón’s pager goes off for incident s_e1vnco7W2.
When a page comes in, the clock starts ticking. Our first priority isn’t to fix the code—it’s to mitigate the impact on users. Thanks to the extensive work on Generic Mitigations by Google SREs, we have a defined, closed, set of standard classes of mitigations (e.g., drain traffic, rollback, restart, add capacity).
This is a perfect task for an LLM: classify the symptoms and select a mitigation playbook. A mitigation playbook is an instruction created dynamically for an agent to be able to execute a production mutation safely. These playbooks can include the command to run, but also instructions to verify that the change is effectively addressing the problem, or to rollback the change.
Ramón opens his terminal and uses Gemini CLI. Gemini immediately calls the fetch_playbook function from ProdAgent (our internal agentic framework). It chains together several tools to build context:
-
get_incident_details: Fetches the alert data from our Incident Response Management system (description, metadata, prior instances, etc). -
causal_analysis: Finds causal relations between different time series behaviour and generic mitigation labels. -
timeseries_correlation: Finds pairs of time series that are correlated, which may help the agent find root causes and mitigation -
log_analysis: Uses log patterns and volumetric analysis to determine anomalies on the stream of logs from the service.
Source Credit: https://cloud.google.com/blog/topics/developers-practitioners/how-google-sres-use-gemini-cli-to-solve-real-world-outages/
