Moving beyond adoption metrics to true operational observability.
As Platform Engineers and DevOps leaders, we often focus heavily on adoption metrics: How many developers are using the new AI tools? How many code suggestions are being accepted?
While these numbers are vital for ROI, they miss a critical operational reality: Reliability.
If a developer’s CLI request fails, is it a client-side config error (4xx) or a server-side outage (5xx)? Standard usage metrics won’t tell you.
In this post, I’ll share how to bridge that gap by building a custom Gemini CLI Health Dashboard on Google Cloud.
⬛️ The Problem: The “Black Box” of CLI Usage
The Gemini CLI is a powerful tool for developers, but like any API-driven utility, it relies on network connectivity, correct authentication, and backend availability. Standard “usage” dashboards often just show successful interactions. They rarely highlight the failures that frustrate developers.
We needed a way to answer 3 questions in real-time:
- Is the CLI working for everyone right now?
- If it’s failing, is it their fault (4xx) or ours (5xx)?
- Are errors spiking after a specific deployment or network change?
📡 The Solution: Telemetry-Driven Observability
We solved this by leveraging the OpenTelemetry capabilities built directly into the Gemini CLI. Instead of treating the CLI as a standalone tool, we configured it to emit telemetry signals directly to our Google Cloud project.
📊 Inside the Dashboard
The dashboard provides a “Control Room” view of our AI infrastructure.

Panel A: The Reliability Ratio ✅ – We use a stacked bar chart to visualize the Success vs. Error Rate. This is our “smoke detector.” In a healthy environment, this bar should be entirely success. Any introduction of errors indicates immediate friction in the developer workflow.
Panel B: The Error Forensic View 🔍 — A line chart breaks down traffic by specific response codes.
- 🟢 2xx (Success): Everything is normal. The system is healthy
- 🟠 4xx (Client Error): Usually indicates expired authentication tokens or malformed prompts. If this spikes, we need to send a communication to developers to investigate at their end (refresh credentials, fix their code, or request a quota increase).
- 🔴 5xx (Server Error): Indicates a backend outage or connectivity issue. These errors indicate a service disruption. If this spikes, check the dashboard as it likely affects multiple users. Such issues require SRE intervention.
💡 Why This Matters
By shifting from “monitoring usage” to “monitoring health,” we transformed our support model. We no longer wait for developers to complain that “the AI isn’t working.” We can see error spikes in real-time and proactively troubleshoot authentication or network issues before they impact the broader engineering team.
⬇️ Get the Dashboard
I believe every platform team managing AI tools needs this level of visibility. I have open-sourced the complete Dashboard JSON configuration and a step-by-step setup guide on GitHub.
https://github.com/malhotradi/gemini-cli-health-dashboard/tree/main
Debugging the Developer Experience: Building a Real-Time Health Dashboard for the Gemini CLI🚀 was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/debugging-the-developer-experience-building-a-real-time-health-dashboard-for-the-gemini-cli-63a47a60432d?source=rss—-e52cf94d98af—4
