AI-Powered SRE Assistant for VMs with Gemini CLI

Managing virtual machines (VMs) in production is a high-stakes task. Performance bottlenecks, resource exhaustion, and service failures can lead to downtime, SLA breaches, and production incidents. Traditionally, troubleshooting involves manual checks — CPU stats, memory usage, disk space, network latency, and service health — often across hundreds of VMs. This can be time-consuming and error-prone.

Gemini CLI

Google’s AI-powered command-line interface integrates with Gemini models to bring intelligent automation and real-time insights directly to your terminal. Instead of running dozens of commands and parsing logs manually, Gemini CLI can diagnose, analyze, and suggest fixes in seconds. Think of it as your AI SRE assistant, capable of analyzing system health, comparing metrics against enterprise thresholds, and suggesting actionable fixes — all in real time.

1. CPU Bottlenecks

CPU is the heart of the VM — if it is overloaded, applications slow down or freeze. Sustained high CPU usage often signals incomplete processes for application leading to degraded performance and timeouts.

Prompt:

Check current CPU utilization and average load on this VM. Compare results for thresholds:
CPU >80% for 5 minutes then consider it asWarning
CPU >90% for 1 minute then consider it as Critical
List top 5 processes by CPU usage and suggest actions if thresholds are breached.

Output:

✦ Here is the current CPU and load average report:

  System Load
   - CPU Utilization: 1.93% (98.07% idle)
   - Load Average (1, 5, 15 min): 0.27, 0.11, 0.03
                                                                                                                                                     ▄
  Thresholds                                                                                                                                         █
   - Warning (CPU >80% for 5 min): Not breached.                                                                                                     █
   - Critical (CPU >90% for 1 min): Not breached.                                                                                                    █
                                                                                                                                                     ▀
  Top 5 Processes by CPU Usage

   1     PID    PPID CMD                         %CPU
   2     891     871 /home/akash/node-v22.11.0    12.0
   3     935     934 npm install @google/gemini-  3.8
   4     871     622 node /home/akash/node-v22    0.9
   5     264       1 /usr/bin/containerd          0.1
   6       1       0 /sbin/init                   0.1

  Conclusion
  The system is currently operating well within normal parameters. No thresholds have been breached, and no immediate action is required.

Command executed by Gemini CLI:

uptime
mpstat -P ALL 1 1
ps -eo pid,ppid,cmd,%cpu --sort=-%cpu | head -n 6

2. Memory Pressure

Memory exhaustion can trigger OOM (out-of-memory) kills, crashing critical services. Applications relying on in-memory caching or large datasets suffer the most, impacting response times and stability.

Prompt:

Analyze memory usage on this VM and report the following
Total, used, free memory
Swap usage
Compare against thresholds:
Free memory <10% or Swap >20% then warning
Swap >50% then critical
Identify top 5 processes by memory usage and suggest fixes if thresholds exceeded.

Output:

✦ Here is the current memory usage report:

  Memory Usage
   - Total Memory: 31Gi
   - Used Memory: 1.1Gi
   - Free Memory: 30Gi
   - Free Memory Percentage: ~96%

  Swap Usage
   - Total Swap: 8.0Gi
   - Used Swap: 0B
   - Swap Usage Percentage: 0%

  Thresholds
   - Warning (Free memory <10% or Swap >20%): Not breached.
   - Critical (Swap >50%): Not breached.

  Top 5 Processes by Memory Usage

   1     PID    PPID CMD                         %MEM %CPU
   2     891     871 /home/akash/node-v22.11.0    1.0  0.3
   3     871     622 node /home/akash/node-v22    0.5  0.0
   4     333       1 /usr/bin/dockerd -H fd:// -  0.2  0.0
   5     264       1 /usr/bin/containerd          0.1  0.2                                                                                           █
   6     279       1 /usr/bin/python3 /usr/share  0.0  0.0                                                                                           █
                                                                                                                                                     █
  Conclusion                                                                                                                                         █
  The system's memory usage is very low and well within the defined thresholds. No action is required.

Command executed by Gemini CLI:

free -h
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -n 6

3. Disk Space & I/O

Disk space is vital for logs, databases, and app writes. When disks fill up or I/O slows, services fail, transactions are slowed, and data integrity risks spike — especially in high-throughput applications.

Prompt:

Check disk usage for all mount points and I/O performance and report:
Disk usage percentage
IOPS and latency
Alert if
Disk usage >80% then warning
Disk usage >90% then critical
Latency >20ms then performance issue
Suggest cleanup or resizing if thresholds breached.

Output:

✦ Here is the disk usage and I/O performance report:

  Disk Usage
  ┌────────────┬───────┬──────┬───────┬──────┬──────────────────────┐
  │ Filesystem │ Size  │ Used │ Avail │ Use% │ Mounted on           │
  ├────────────┼───────┼──────┼───────┼──────┼──────────────────────┤
  │ /dev/sdd   │ 1007G │ 4.3G │ 952G  │ 1%   │ /                    │
  │ C:\\       │ 951G  │ 143G │ 809G  │ 15%  │ /mnt/c               │
  │ drivers    │ 951G  │ 143G │ 809G  │ 15%  │ /usr/lib/wsl/drivers │
  └────────────┴───────┴──────┴───────┴──────┴──────────────────────┘

   - Warning (Disk usage >80%): Not breached.
   - Critical (Disk usage >90%): Not breached.

  All filesystems are well below the usage thresholds.

  I/O Performance (for /dev/sdd)

   - IOPS (Read/Write): 1.91 / 1.13
   - Latency (Read/Write ms): 0.25 / 8.02
   - Performance Issue (Latency >20ms): Not breached.

  The disk I/O performance is good, with latency well below the 20ms threshold.                                                                      ▄
                                                                                                                                                     █
  Conclusion                                                                                                                                         █
  There are no issues with disk usage or I/O performance at this time. No action is required.

Command executed by Gemini CLI:

df -h
iostat -dx

4. Network Latency & Connectivity

Network issues break the backbone of any application. High latency or packet loss can mean connectivity to API failure, database connections timeout, etc. User experience can be deeply affected and can lead to potential break outs.

Prompt:

Diagnose network health and report:
Basic connectivity to internet
Packet loss percentage on VM
Bandwidth usage per interface
Alert if
Latency >100ms then warning
Packet loss >1% then critical
Suggest firewall/VPC fixes or bandwidth optimization if thresholds exceeded.

Output:

✦ Summary:
  Your network health appears to be excellent. All metrics are within normal parameters.

  Details:

   * Basic Internet Connectivity:
       * Successfully connected to the internet. 5 out of 5 packets were transmitted to and received from google.com.

   * Latency:
       * The average latency to google.com was 34.008ms.
       * The maximum latency was 71.997ms.
       * Result: This is well below the 100ms warning threshold.

   * Packet Loss:
       * There was 0% packet loss during the test.
       * Result: This is below the 1% critical threshold.

   * Bandwidth Usage:
       * Bandwidth usage was monitored for 5 seconds. The following are the average values:
           * lo (localhost): ~2 KB/s in/out
           * eth0: ~1 KB/s in/out
           * docker0: 0 KB/s in/out
       * Result: Bandwidth usage is very low on all interfaces.
                                                                                                                                                     █
   * Alerts:                                                                                                                                         █
       * Latency: No warning.                                                                                                                        █
       * Packet Loss: No critical alert.                                                                                                             █
                                                                                                                                                     █
   * Suggestions:                                                                                                                                    █
       * Your network is performing well. No firewall, VPC, or bandwidth optimization changes are recommended at this time.

Command executed by Gemini CLI:

ping -c 5 google.com
ifstat -a -t 1 5

5. Critical SSHD service & Limits

SSHD ensures remote SSH access for ops teams, if it’s down, teams are locked out. Low ulimit values choke high-load apps, causing crashes under peak traffic — hence making these checks is essential for production resilience.

Prompt:

Verify SSHD service. Check ulimit for file descriptors.
Alert if
SSHD active/inactive
ulimit for high-load apps 
Provide steps to restart services or increase limits if needed.

Output:

✦ System Check Summary                                                                                                                               █
                                                                                                                                                     █
  SSHD Service:                                                                                                                                      █
   * Status: Inactive                                                                                                                                █
                                                                                                                                                     █
  File Descriptors (ulimit):                                                                                                                         █
   * Limit: 1048576                                                                                                                                  █
                                                                                                                                                     █
  Alerts & Recommendations:                                                                                                                          █
                                                                                                                                                     █
   * SSHD Inactive: The SSH daemon is not running.                                                                                                   █
       * To start it now: sudo systemctl start sshd                                                                                                  █
       * To enable it on boot: sudo systemctl enable sshd                                                                                            █
                                                                                                                                                     █
   * ulimit: The file descriptor limit is high and suitable for demanding applications. No action is required.

Command executed by Gemini CLI:

systemctl is-active sshd
ulimit -n

5. Miscellaneous system checks

Zombieprocesses indicate incomplete cleanup after execution, consuming process table entries. While they don’t use CPU or memory, too many zombies can exhaust PID limits and block new processes — critical in high-load systems. Kernel logs reveal low-level hardware or driver issues like disk failures or memory errors. Repeated failed SSH attempts often signal brute-force attacks or misconfigured credentials. If not checked, this can escalate into security breaches or account lockouts, impacting operational access.File descriptors represent open files, sockets, and pipes. If usage nears system limits, applications may fail to open new connections or files, causing cascading service outages under heavy traffic.

Prompt:

Perform additional health checks:
Detect zombie processes and report count
Scan kernel logs for hardware errors
Check failed SSH login attempts in /var/log/auth.log
Report current open file descriptors count
Show load average trends for last 15 minutes
Highlight any anomalies and suggest remediation.

Output:

✦ Health Check Summary

  Zombie Processes:
   * Count: 0 (Healthy)

  Kernel Log Errors:
   * Errors:
       * PCI: Fatal: No config space access function found
       * misc dxg: dxgk: ... Ioctl failed
       * WSL (...) ERROR: CheckConnection: getaddrinfo() failed
   * Analysis: Kernel log shows a potential hardware issue (PCI), unusual graphics-related errors for a server (dxgk), and WSL networking problems.
   * Remediation:
       * PCI: Run lspci -vv to investigate the hardware.
       * dxgk: Ignore if this is a server; otherwise, check graphics drivers.
       * WSL: Investigate WSL networking configuration.

  Failed SSH Logins:
   * Count: 0 (Good)

  Open File Descriptors:
   * Count: 1708 (Well within the 1,048,576 limit)                                                                                                   ▄
                                                                                                                                                     █
  Load Average (1, 5, 15 min):                                                                                                                       █
   * Values: 0.19, 0.10, 0.03 (System is idle)                                                                                                       █
                                                                                                                                                     █
  Anomalies & Remediation Summary:   
  The primary concern is the kernel log's PCI error, which warrants investigation. The dxgk and WSL errors are environment-specific and may not be   █
  critical.                                                                                                                                          █
                                                                                                                                                     █
  All checks are complete.

Command executed by Gemini CLI:

ps aux | awk '$8=="Z"' | wc -l
journalctl -k -p err
grep "Failed password" /var/log/auth.log | wc -l
lsof | wc -l
uptime

That’s it we have completed a comphrensive list of checks on a VM using Gemini CLI.

Conclusion:

Enterprises operate in environments where downtime is costly, performance is critical, and scale is non-negotiable. Traditional troubleshooting methods manual commands, ad-hoc scripts, and reactive approach cannot keep up with the complexity and velocity of modern infrastructure. Gemini CLI bridges this gap by bringing AI-driven intelligence directly to operations workflow. It transforms VM monitoring and troubleshooting from a manual, error-prone process into an automated, proactive, and context-aware experience. By leveraging rich prompts and backend automation, Gemini CLI doesn’t just check metrics — it interprets them against graded thresholds, recommends fixes, and even integrates with alerting workflows. This means the SRE and DevOps teams can focus on strategic improvements instead of reacting to incidents after they occur. In short, Gemini CLI an AI-powered companion, enabling teams to achieve resilience, efficiency, and agility at scale.

AI-Powered SRE Assistant for VMs with Gemini CLI was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/ai-powered-sre-assistant-for-vms-with-gemini-cli-3256b5764dd0?source=rss—-e52cf94d98af—4