
In the world of modern data engineering, building high quality and highly efficient data pipelines alone is not the objective, rather maintaining them to be highly efficient in production environment is of extreme importance.
Traditional DataOps relies on manual intervention and static runbooks. However, we are entering the era of Agentic Operations. By bringing the reasoning power of Gemini directly into the terminal via the Gemini CLI, SREs can move from manual log-grepping to high-level system orchestration. This blog explores how the Model Context Protocol (MCP) and Gemini CLI transform a reactive SRE into a proactive Orchestrator of autonomous data pipelines.
To see this in action, imagine a Medallion Architecture (Bronze → Silver → Gold) running on GCP.
- The Failure: A scheduled Dataproc job processing the Orders table from Silver to Gold fails.
- The Impact: Executive dashboards are showing stale data.
- The Complexity: The failure could be a schema mismatch in BigQuery, a transient network error in the Dataproc cluster, or a data quality violation in the Silver layer.
Throughout this FAQ blog, we will see how an SRE uses the Gemini CLI to triage this specific failure in seconds, rather than the usual hour of manual investigation.
What’s DataOps and What’s the importance of a good SREs in DataOps squads?
DataOps is a collaborative data management practice focused on improving the communication, integration, and automation of data flows between data providers and consumers. Think of it as DevOps for data pipelines where it applies Agile and CI/CD principles to ensure that data delivery is high-quality, reproducible, and fast.
In a DataOps squad, the Site Reliability Engineer (SRE) is the guardian of the Data estate. While data engineers build the pipelines, the SRE ensures those pipelines are:
- Scalable: Handling sudden spikes in data volume without crashing.
- Observable: Providing clear metrics on data freshness, quality, and lineage.
- Resilient: Automatically recovering from transient network blips or source system downtime.
Without a strong SRE, a DataOps squad often falls into a cycle of reactive firefighting, where manual fixes to broken tables consume 80% of the team’s time. We will discuss some of the key metrics that modern AI driven SREs look at in the subsequent sections of this blog.
How do they work in a traditional manner and how has AI disrupted the way they work?
Traditionally, SREs follow a Reactive-Manual workflow:
- Detection: A threshold-based alert triggers (For example — Airflow task failed).
- Context Gathering: The SRE manually jumps between the GCP Console, Cloud Logging, and BigQuery to find the error.
- Diagnosis: They manually grep through thousands of lines of logs to find the one Java stack trace that matters.
- Remediation: They follow a static runbook to restart the job or patch the schema
AI has disrupted this approach by introducing the AI-Augmented Mitigation model:
- From Searching to Asking: Instead of digging through logs, the SRE asks the CLI: Why did the ‘Orders’ pipeline fail at 2 AM?
- Parallel Investigation: The AI doesn’t check one tool at a time. It simultaneously queries BigQuery schemas, analyzes Dataproc logs, and checks Dataflow job states.
- Reasoning Over Data: AI bridges the gap between what happened (the error) and why (the root cause), shifting the SRE’s role from Log Searcher to Decision Maker.
What is Gemini CLI and Why are SREs using it for Production outages?
Gemini CLI is an open-source AI agent that brings the reasoning power of the Gemini models directly into the terminal. SREs will not longer just rely on watching dashboards and reacting to a failure/broken pipeline, instead they opt for an AI augmented mitigation model. One of the most important metric for SREs is how quickly they can get things back up to running state when something is broken which is measured as Average resolution time which cracks the total time to fix a root cause. One another important aspect is MTTM (Mean time to mitigation) which aims to stop the bleeding for the customer. This approach is more efficient because it prioritizes service availability over deep-dive repairs. In order to achieve low MTTM, log synthesis, identifying the root cause and initiating actions to bring the service back up and running is important but automating all these is essential. In short agentigy-ing them is the way forward. This is exactly where Gemini CLI turns highly effective
How does Gemini CLI change the way a DataOps architect manages thousands of pipelines?
For the use case we are applying these theories, managing 1000s of pipelines traditionally requires the SREs to jump between between Cloud Composer, Dataproc logs, and BigQuery schemas to find a failure. Gemini CLI acts as a Unified Orchestrator. Instead of searching across tabs, the architect asks the CLI to triage the failure. The AI uses tools to fetch logs and check data lineage, summarizing the root cause (example — a schema mismatch) in seconds rather than minutes.
What are these tools? How do they get access to my logs, projects, datasets to do the root cause analysis?
In order to know about the tools, we need to understand the concepts of MCP (Model context protocol). MCP is an open standard that connects AI agents to external tools and data sources. Before MCP, every AI had to have custom code to talk to a database or an API. MCP standardizes this like a USB-C port that handles power, data, and video through one plug, MCP allows Gemini to connect to any service whether its BigQuery, Slack, GitHub, or Kubernetes — using a single, universal protocol.
As a part of the setup, we install MCP Servers. We run the below code because in our hypothetical data pipeline failure scenario, we need the CLI to access to cloud logging, bigquery and dataplex to capture data lineage
npm install -g @modelcontextprotocol/server-google-cloud-logging
npm install -g @modelcontextprotocol/server-bigquery
npm install -g @google/mcp-server-dataplex
Next, we need to configure settings.json in project root. We add the below to this json file to give Gemini the required permission to use these tools
{
mcpServers: {
gcp-logs: {
command: npx,
args: [-y, @modelcontextprotocol/server-google-cloud-logging],
env: { GOOGLE_CLOUD_PROJECT: your-prod-project-id }
},
bigquery: {
command: npx,
args: [-y, @modelcontextprotocol/server-bigquery],
env: { BIGQUERY_PROJECT: your-prod-project-id }
},
dataplex: {
command: npx,
args: [-y, @google/mcp-server-dataplex],
env: { DATAPLEX_PROJECT: your-prod-project-id }
}
}
}
For authentication, we can use either a Service account or alternatively uses ADC (Application default credentials)
gcloud auth application-default login
Once this setup is done via CLI, we are all set to start the triage using the Gemini CLI
How does this work when all of it comes together?
Lets use our real world example to explain this. From the scenario we discussed at the beginning of this blog, lets say a pipeline that processes the data from bronze layer to silver layer and does data quality enforcements + some data cleansing has failed.
Given that the CLI now has access to the tools that are required for our RCA, here’s how our diagnosis goes:
- Diagnosis: You ask Gemini CLI: Why did the orders pipeline fail? It calls the Logging MCP to find an Out of Memory error
- Impact: You ask: Is the Gold layer stale? It calls the BigQuery MCP to check the last update timestamp
- Mitigation: Gemini proposes: I should increase Dataproc memory for this batch. You type Approve, and it calls the Dataproc MCP to restart the job with 16GB instead of 8GB.
This is where the bleeding stops early without a lot of time being spent on the RCA and deep dive on the errors and thereby leading to reduced MTTM
If LLMs are non-deterministic (unpredictable), how can they be safe for production operations?
We use a Deterministic Sandwich architecture. The Brain (Gemini in our case) is probabilistic. It decides what to do. However, the hands (MCP Tools) are strictly deterministic. They are pre-written, unit-tested scripts. We also add a Safety Policy Layer that checks the AI’s intent against hard rules (example — Reject any command that deletes a Production dataset), ensuring the AI only operates within safe, human-defined guardrails.
In our Medallion pipeline scenario (where a Spark job failed and corrupted a Gold feature table), the architecture works like this.
The Brain (Probabilistic Layer)
- What it does: It looks at the messy Spark logs and BigQuery errors. It uses intuition to guess: I think the Silver table is corrupt because the source schema changed.
- The Risk: Because it’s probabilistic, it might occasionally suggest something wild, like Delete the Silver table and start over.
The Safety Layer
- What it does: It acts as a filter between the Brain’s idea and the actual system.
- In our use case: You set a rule: AI is never allowed to run a DROP or DELETE command on Production datasets.
- The Result: Even if Gemini thinks deleting a table is the best way to fix the pipeline, the Safety Layer blocks that request before it ever reaches your data.
The Hands (Bottom Layer which uses MCP tools)
- What it does: These are rigid, pre-written Python scripts or API calls that have one specific job. They don’t think; they just execute.
- In our use case: You give Gemini a tool called restart_spark_batch(batch_id).
- The Result: Gemini doesn’t write its own code to restart the job (which could have bugs). It simply pushes the button on the script you already wrote and tested.
In summary, We don’t let the AI write and execute its own code on the data. Instead, we use the Deterministic Sandwich which is The AI reasons about the failure (The Brain), a set of hard rules prevents dangerous actions (The Safety Layer), and the actual fix is carried out by vetted, pre-written scripts (The Hands). It’s the speed of AI with the safety of traditional engineering.
How does starting with a CLI-based approach eventually lead to fully autonomous Self-Healing pipelines?
The CLI is the Training Ground. By using Gemini CLI, engineers teach the model which tools and prompts work for specific failures. Once a certain workflow (For ex: If Spark fails with OOM, retry with 2x memory) is proven 100% reliable in the CLI, you can move that logic into an Autonomous Agent (using the Agent SDK) that triggers automatically from an alert without any human typing
Which specific GCP MCP Servers are “must-haves” for a DataOps SRE?
To maintain production pipelines sensibly, an SRE needs a “sensory kit” that covers the entire data lifecycle. While the standard Google Cloud MCP servers provide a broad base, these four are non-negotiable for DataOps:
- BigQuery MCP: Essential for querying metadata and validating data quality. It allows Gemini to ask: “Did the row count drop significantly between the last two batches?”
- Cloud Logging MCP: The eyes of the SRE. It allows the CLI to synthesize logs across services (example: correlating a GCS 404 error with a failing Dataflow worker).
- Dataplex MCP: Crucial for Data Lineage. When a Gold table fails, the SRE uses this to find the exact Silver source that corrupted the flow.
- GKE/Dataproc MCP: These provide the hands. If a job fails due to memory, these tools allow Gemini to propose (and execute) a cluster resize or a job restart with new parameters.
How do we handle “Human-in-the-Loop” for high-stakes production changes?
In a production environment, auto-approving every AI action is a recipe for disaster. Maintaining pipelines sensibly requires a Human-in-the-Loop (HITL) framework
- Interactive Approval: By default, Gemini CLI operates in interactive mode. Before it runs a command like gcloud dataproc jobs submit, it presents the full command to you. You are the final gatekeeper.
- Plan Review: In our Medallion scenario, Gemini might suggest: “I will increase the Dataproc worker memory to 16GB and restart the job.” It won’t act until you type Approve or Y.
- The “Dry Run” Habit: You can instruct Gemini to always show the gcloud equivalent or a SQL preview before execution.
- Audit Logs: Because the CLI uses your credentials (ADC), every action is logged in GCP Cloud Audit Logs, ensuring accountability for every AI-driven mitigation.
DataOps is no longer just about moving bytes from point A to point B; it’s about the integrity of those bytes and the speed at which you can recover when the flow breaks. By adopting the Gemini CLI and the Model Context Protocol (MCP), SREs are essentially upgrading their terminal from a simple command line to an intelligent co-pilot that can reason through complex failures in real-time
To move from theory to production, explore these curated resources:
- Hands-on Codelab: Mastering Gemini CLI — A step-by-step guide to installing the CLI and configuring your first MCP servers.
- Official Documentation: Gemini CLI & MCP Overview — Technical deep-dives into the architecture and tool-calling capabilities
Thanks for reading!
2 AM Alerts Just Got Easier: Using Gemini CLI as a Unified Orchestrator for DataOps was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/2-am-alerts-just-got-easier-using-gemini-cli-as-a-unified-orchestrator-for-dataops-70a1757e2560?source=rss—-e52cf94d98af—4
