

Generating content with Large Language Models (LLMs) is easy. Determining whether the generated content is good is hard. That’s why evaluating LLM outputs with metrics is crucial. Previously, I talked about DeepEval and Promptfoo as some of the tools you can use for LLM evaluation. I also talked about RAG triad metrics specifically for Retrieval Augmented Generation (RAG) evaluation for LLMs.
In the next few posts, I want to talk about a Google Cloud specific evaluation service: the Gen AI evaluation service in Vertex AI. The Gen AI evaluation service in Vertex AI lets you evaluate any generative model or application against a set of criteria or your own custom criteria.
It supports a number of metrics out of the box. Let’s explore the classes of metrics that it supports before diving into details.
The first class of metrics are computation-based metrics because they can be computed using a mathematical formula. These are statistical metrics such as rouge, bleu or deterministic metrics such as tool use (whether the LLM used a tool correctly).
These metrics offer the advantage of determinism: the same input consistently yields the same score.
The disadvantage is that these metrics require a ground truth or reference to compare and calculate the score against. You might not have that for your use case. A more fundamental disadvantage is that metrics such as rouge, bleu tend to not capture the nuances in language (more on this later).
Computation-based metrics are a good starting point but at some point, you’d need model-based metrics to fully evaluate your LLM outputs.
The idea of model-based metrics is to use a judge model to assess a candidate model. The judge model for most use cases is Gemini based but the service also supports other models such as MetricX or COMET for translation use cases.
At this point, you might be wondering: How reliable is it to use one LLM to judge another LLM?
I had the same question but according to the Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena paper, LLMs have 80% agreement with human evaluations. The same level of agreement between humans.
LLMs tend to be good judges with good prompts and scoring rubrics. The Gen AI service provides prompt templates for different model-based metrics out of the box such as fluency, groundedness, verbosity. You can also define your own metrics with your own prompts.
Model-based metrics can further be measured in pointwise (single model) or pairwise (two models) ways.
It’s very common nowadays to have tool-use in LLM apps where the model calls a function for some external data. Gen AI evaluation service provides tool-use metrics that assess if a tool (function) call and name are valid and if the parameter names and values match to what you expect.
These tool-use metrics can be categorised under computation-based metrics, as they are deterministic and computed rather than relying on another LLM to calculate and evaluate.
We started seeing more and more agentic applications where an agent uses multiple tools and the LLM to plan and orchestrate certain tasks. Since they rely on LLMs, they need to be evaluated as well. In evaluating agents, one has to evaluate not only the final response, but also the trajectory (the sequence of tool calls) the agent took to reach the final response.
Gen AI evaluation supports both. Agent metrics are mostly computation-based but they can also have elements of model-based metrics for its output.
So far, all the metrics I mentioned are for text-based outputs. However, LLMs nowadays produce multimodal (images, videos). It’s important to evaluate the multimodal outputs as well.
The Gen AI evaluation service supports image and video output evaluation with Gecko. Gecko is a rubric-based and interpretable autorater for evaluating generative AI models with a more nuanced, customizable, and transparent way to assess the performance of image and video generation models. You can read Evaluate your gen media models with multimodal evaluation on Vertex AI for more details.
As of today, this is the full list of metrics supported out of the box:
Computation-based:
exact_match
bleu
rouge
rouge_1
rouge_2
rouge_l
rouge_l_sum
Model-based:
comet
metricx
coherence
fluency
safety
groundedness
instruction_following
verbosity
text_quality
summarization_quality
question_answering_quality
multi_turn_chat_quality
multi_turn_safety
pairwise_coherence
pairwise_fluency
pairwise_safety
pairwise_groundedness
pairwise_instruction_following
pairwise_verbosity
pairwise_text_quality
pairwise_summarization_quality
pairwise_question_answering_quality
pairwise_multi_turn_chat_quality
pairwise_multi_turn_safety
Tool-use:
tool_call_valid
tool_name_match
tool_parameter_key_match
tool_parameter_kv_match
Agent:
trajectory_exact_match
trajectory_in_order_match
trajectory_any_order_match
trajectory_precision
trajectory_recall
trajectory_single_tool_use
This post provided an overview of the Gen AI evaluation service in Vertex AI and detailed the various metric types it supports. Here are some official references for further information:
You can also check out my curated tutorial and samples:
In subsequent posts, we will delve into practical implementation details for some of these metrics.
Source Credit: https://medium.com/google-cloud/gen-ai-evaluation-service-an-overview-f7d6048cf6e3?source=rss—-e52cf94d98af—4