
Check out these examples of running rubric-based evaluation for instruction-following, multimodal, and text quality. Also, we have worked with our research team to implement rubrics-based autorater for text- to-image and text-to-video.
4. Agent evaluation
We are at the beginning of the agentic era, where agents reason, plan, and use tools to accomplish complex tasks. However, evaluating these agents presents a unique challenge. It’s no longer sufficient to just assess the final response; we need to validate the entire decision-making process. “Did the agent choose the right tool?”, “Did it follow a logical sequence of steps?”, “Did it effectively store and use information to provide personalized answers?”. These are some of the critical questions that determine an agent’s reliability.
To address some of these challenges, the Gen AI evaluation service in Vertex AI introduces capabilities specifically for agent evaluation. You can evaluate not only the agent’s final output but also gain insights into its “trajectory”—the sequence of actions and tool calls it makes. With specialized metrics for trajectory, you can assess your agent’s reasoning path. Whether you’re building with Agent Development Kit, LangGraph, CrewAI, or other frameworks, and hosting them locally or on Vertex AI Agent Engine, you can analyze if the agent’s actions were logical and if the right tools were used at the right time. All results are integrated with Vertex AI Experiments, providing a robust system to track, compare, and visualize performance, enabling you to build more reliable and effective AI agents.
Here you can find a detailed documentation with several examples of agent evaluation with Gen AI evaluation service on Vertex AI.
Finally, we recognize that evaluation remains a research frontier. We believe that collaborative efforts are key to addressing current challenges. Therefore, we are actively working with companies like Weights & Biases, Arize, and Maxim AI. Together, we aim to find solutions for open challenges such as the cold-start data problem, multi-agent evaluation, and real-world agent simulation for validation.
Get started today
Ready to build reliable LLMs applications ready for production on Vertex AI? The Gen AI evaluation service in Vertex AI addresses the most requested features from users, providing a powerful, comprehensive suite for evaluating your AI application. By enabling you to scale evaluations, build trust in your autorater, and assess multimodal and agentic use cases, we want to foster confidence and efficiency, ensuring your LLM-based applications perform as expected in production.
Check the comprehensive documentation and code examples for the Gen AI evaluation service.
Source Credit: https://cloud.google.com/blog/products/ai-machine-learning/how-to-evaluate-your-gen-ai-at-every-stage/