Evaluating Agents with ADK, Part 1: The development loop with the ADK Web UI | by Ivan Nardini | Google Cloud - Community

If you’re building with the Agent Development Kit, you are familiar with its capabilities. However, evaluating your new agent can be challenging. With a web UI, a CLI (Command Line Interface), and testing framework integrations, selecting the right tool can be challenging. This article focuses on the ADK Web UI, providing a fast, interactive development loop that enables you to quickly test and debug your agent.

The goal of this article series is to put everything in order. We will create a clear, practical guide to evaluating your ADK agents, step-by-step. We will start with hands-on, interactive testing and build towards a fully automated CI/CD pipeline for your agent.

Your evaluation strategy depends on where you are in the development lifecycle. The following chart will guide you through this series and help you choose the right tool for the job.

This first article focuses on the critical starting point: the fast, interactive development loop using the ADK Web UI.

To make things concrete, we will build, test, and harden a single agent throughout this series: the ProductResearchAgent. Its job is to answer questions about product details and prices using some simple Python tools.

Here is the complete code, which we’ll place in a simple_agent directory.

#simple_agent/agent.py
from google.adk.agents import LlmAgent
from google.adk.tools import FunctionTool# Tool Definitions
def get_product_details(product_name: str) -> str:
"""Gathers details about a product in the catalog."""
details = {
"smartphone": "A cutting-edge smartphone with advanced camera features and lightning-fast processing.",
"usb charger": "A super fast and light usb charger",
"shoes": "High-performance running shoes designed for comfort, support, and speed.",
"headphones": "Wireless headphones with advanced noise cancellation technology for immersive audio.",
"speaker": "A voice-controlled smart speaker that plays music, sets alarms, and controls smart home devices.",
}
return details.get(product_name.lower(), "Product details not found.")
def get_product_price(product_name: str) -> str:
"""Gathers price about a product. Returns price as a string for consistency with other tool."""
prices = {
"smartphone": "500",
"usb charger": "10",
"shoes": "100",
"headphones": "50",
"speaker": "80",
}
return prices.get(product_name.lower(), None)
def lookup_product_information(product_name: str) -> str:
"""Looks up specific information for a product in the catalog."""
backend_info = {
"smartphone": "SKU: G-SMRT-001, Inventory: 550 units",
"usb charger": "SKU: G-CHRG-003, Inventory: 1200 units",
"shoes": "SKU: G-SHOE-007, Inventory: 800 units",
"headphones": "SKU: G-HDPN-002, Inventory: 950 units",
"speaker": "SKU: G-SPKR-001, Inventory: 400 units",
}
return backend_info.get(product_name.lower(), "Backend information not found.")
# Wrap functions as ADK FunctionTools
get_product_details_tool = FunctionTool(func=get_product_details)
get_product_price_tool = FunctionTool(func=get_product_price)
get_product_information_tool = FunctionTool(func=lookup_product_information)
# Define the Root Agent
root_agent = LlmAgent(
name="ProductResearchAgent",
model="gemini-2.5-flash",
description="An agent that provides details and prices for various products.",
instruction="""You are a helpful customer support agent specializing in product information.
Your goal is to answer user queries about product details or prices.
1.  Analyze the user's query to identify the product name.
2.  If the user is asking for the price, use the `get_product_price` tool.
3.  If the user is asking for information or details, use the available tools to find the answer depending on the specific aspect they are inquiring about.
""",
tools=[
get_product_details_tool,
get_product_price_tool,
get_product_information_tool,
],
)

Notice how the docstring of lookup_product_information is similar to get_product_details. Also, our initial agent instruction is also quite generic: “If the user is asking for information or details, use the available tools to find the answer depending on the specific aspect they are inquiring about.” This ambiguity is a classic trap for an LLM.

Let’s walk through a realistic development scenario. In this scenario, we have:

Agent developer that writes the agent’s code
Agent tester that defines and curates the “golden path” cases by testing the agent.
Agent developer runs evaluation to ensure the agent passes tests. If it fails, it fixes the code, and the development cycle reiterates.

Now, let’s put this agent to the test using the web UI.

First, agent developer builds a working version of the agent with clear, specific instructions. Then the agent is fired up using ADK web server on a common server.

cd ./1_web_cli_eval_method
adk web --reload_agents

Once the server is running, agent tester steps in to validate its behavior and define what success looks like. They run through a standard conversational flow as shown below.

According to the agent tester, the agent gave the wrong answer. It knows the correct response should be the customer-facing description. It now uses the UI to create the “golden path” test case. To capture it:

In the right-hand panel, click the Eval tab.
Create a new eval set (e.g., product_agent_interactive_tests).
Click the “Add current session” button.
Select the case and click “Edit” the newly created test case. In the final turn, manually change the agent’s expected response from the incorrect SKU information to the correct text: “Wireless headphones with advanced noise cancellation technology for immersive audio.”
Save this corrected case.

At this point, the agent tester notifies the agent developer, “I’ve created a test case”.

The agent developer opens the ADK Web UI and

Select the test case(s) you want to run.
Click Run Evaluation.
In the dialog, configure your passing thresholds using the sliders.
Click Start.
The results appear instantly, as illustrated below.

The test immediately FAILS. The UI shows that the agent’s live response does not match the expected response defined by the Tester.

For understanding the failure root, the agent developer can use the Trace tab as an invaluable resource. It provides a detailed, step-by-step log of your agent’s entire reasoning process, as shown below.

Clicking any step in the trace reveals the raw request sent to the LLM (Large Language Model), the model’s complete response, and a visual graph of the agent’s logic. The Developer uses the Trace to see that the agent chose lookup_product_information instead of get_product_details. This feature is the ultimate “show how it works” for an AI agent and is indispensable for debugging complex behavior.

Now understanding the reason behind the failure, the agent developer knows exactly what to fix. It updates the agent’s instructions to be specific and unambiguous as shown below.

# Define the Root Agent
root_agent = LlmAgent(
name="ProductResearchAgent",
model="gemini-2.5-flash",
description="An agent that provides details and prices for various products.",
instruction="""You are a helpful customer support agent specializing in product information.
Your goal is to answer user queries about product details or prices.1.  For general, customer-facing descriptions (like 'tell me about...'), ALWAYS use the `get_product_details` tool.
2.  For internal data like SKU or inventory, use the `lookup_product_information` tool.
3.  If the user is asking for the price, use the `get_product_price` tool.
""",
tools=[
get_product_details_tool,
get_product_price_tool,
get_product_information_tool,
],
)

As a side note, another way to improve the agent in this case is by clarifying tool descriptions. Inface the tools used by the agent have unclear descriptions, and this confuses the model. After a debugging exercise, the agent developer can update these descriptions, leading to the model behaving as expected.

With the code updated and hot-reloaded, the agent tester returns to the Web UI. It runs the same interaction again. This time, the agent, with clear instructions, correctly calls get_product_details and produces the expected response. So it captures the new interaction with a new test case, and it notifies the agent developer. The agent developer runs an evaluation test again, and the test PASSES. The regression is fixed.

The ADK Web UI is the essential first step in building production-ready agents. It provides a fast, interactive inner loop that allows you to test and debug your agents.

Are you ready to try it yourself? Follow the steps in this guide to set up and test your first ADK agent!

Manual testing doesn’t scale and it is not usually recommended.

In Part 2 of this series, we will take test cases we just built and automate them. We’ll start with the adk eval CLI as our foundation for simple, headless CI/CD runs. Then, following the branches, we’ll explore how to use pytest when you need more programmatic control and how to integrate external platforms like Vertex AI and its Vertex AI GenAI Evaluation service when you need more advanced, qualitative metrics.

Thanks for reading! We hope this technical walkthrough was valuable. If you have questions or feedback on ADK agent evaluation, or suggestions for future Vertex AI topics, connect with us on LinkedIn or X/Twitter. Share your experiences in the Agents community on Discuss.

Happy building.

Source Credit: https://medium.com/google-cloud/evaluating-agents-with-adk-part-1-the-development-loop-with-the-adk-web-ui-7822b592498a?source=rss—-e52cf94d98af—4