
In my Gen AI Evaluation Service — An Overview post, I introduced Vertex AI’s Gen AI evaluation service and talked about the various classes of metrics it supports. In today’s post, I want to dive into computation-based metrics, what they provide, and discuss their limitations.
Computation-based metrics are metrics that can be calculated using a mathematical formula. They’re deterministic — the same input produces the same score, unlike model-based metrics where you might get slightly different scores for the same input.
They require a reference (ground truth) to compare against when calculating the score. That can be a drawback if you don’t have such ground truth. Also, they tend to be quite rigid and do not capture the nuances of language. We’ll talk more about this later.
These are the computation-based metrics supported by Gen AI evaluation service out of the box:
exact_match
bleu
rouge
rouge_1
rouge_2
rouge_l
rouge_l_sum
Note: There are metrics related to tool use and agents, which can also be considered computation-based. We’ll cover them separately in a future blog post.
EXACT_MATCH
metric computes whether a generated text matches a reference text exactly. 0: No match, 1: Match.
BLEU
( BiLingual Evaluation Understudy) metric evaluates the similarity between generated text and reference text. It measures precision (accuracy). A float in the range of 0-1. 0: Poor similarity, 1: Perfect match to reference.
ROUGE
( Recall-Oriented Understudy for Gisting Evaluation) metric evaluates text summaries. It measures recall (completeness). A float in the range of 0-1. 0: Poor similarity, 1: Strong similarity to reference.
Additionally, you can define your own custom metric with the CustomMetric
class.
Let’s see how we can run an evaluation with the standard metrics.
First, define an evaluation dataset with responses from the model and the reference you want to compare against:
responses = [
"Hello, how are you?",
"I'm good",
"The cat lay on the mat."
]references = [
"Hello, how are you?",
"I am good",
"The cat sat on the mat."
]
eval_dataset = pandas.DataFrame(
{
"response": responses,
"reference": references,
}
)
Then, define the metrics. Here, we’ll go with EXACT_MATCH
, BLEU
and the standard ROUGE
metric:
metrics = [
Metric.EXACT_MATCH,
Metric.BLEU,
Metric.ROUGE,
]
Create an evaluation task, run it, and then print the results:
eval_task = EvalTask(
dataset=eval_dataset,
metrics=metrics,
experiment=get_experiment_name(__file__)
)eval_result = eval_task.evaluate()
print_eval_result(eval_result)
Run the evaluation:
python computation.py==Summary metrics==
row_count: 3
exact_match/mean: 0.3333333333333333
exact_match/std: 0.5773502691896258
bleu/mean: 0.5596273066666667
bleu/std: 0.40962295108770713
rouge/mean: 0.8333333333333334
rouge/std: 0.16666665000000253
==Metrics table==
response reference exact_match/score bleu/score rouge/score
0 Hello, how are you? Hello, how are you? 1.0 1.000000 1.000000
1 I'm good I am good 0.0 0.189959 0.666667
2 The cat lay on the mat. The cat sat on the mat. 0.0 0.488923 0.833333
Note that I’m simply printing the results in a utility function but you can also view the results in Vertex AI’s Experiments section:
You can see computation.py for the full sample.
Let’s now talk about why computation-based metrics fall short in capturing the semantic meaning.
These are the results we had:
==Metrics table==
response reference exact_match/score bleu/score rouge/score
0 Hello, how are you? Hello, how are you? 1.0 1.000000 1.000000
1 I'm good I am good 0.0 0.189959 0.666667
2 The cat lay on the mat. The cat sat on the mat. 0.0 0.488923 0.833333
As you can see, when there’s a perfect match, all the scores are good. However, when there’s a slight change in wording, the scores drop significantly. As humans, we understand that “I’m good” and “I am good” have the same meaning but BLEU score is already at 0.18 and ROUGE is not that much better with 0.66.
Because of the rigid nature of these metrics, you probably need to use model-based metrics in your evaluations. We’ll cover this in a future blog post.
You can also define your own metrics and run them as part of your evaluation. Note that these custom metrics run on the client side without going through the Gen AI evaluation service but they can still be useful as part of your overall evaluation.
For example, let’s say you want to use “word count” as a custom metric.
First, you define your metric with the CustomMetric class:
custom_word_count_metric = CustomMetric(
name="word_count",
metric_function=word_count)
Then, you define the actual function that will do the word counting and return the result:
def word_count(instance: dict[str, str]) -> dict[str, float]:
"""Count the number of words in the response."""response = instance["response"]
score = len(response.split(" "))
return {
"word_count": score,
}
Run the evaluation:
python computation_custom.py==Metrics table==
response word_count/score
0 Hello, how are you? 4
1 I'm good 2
2 The cat lay on the mat. 6
Take a look at computation_custom.py for the full sample.
This wraps up our discussion about computation-based metrics. While they’re useful baseline in your evaluations, they fall short in capturing the nuances of language. In the next post, we’ll take a look at model-based metrics in more detail.
References:
Source Credit: https://medium.com/google-cloud/gen-ai-evaluation-service-computation-based-metrics-1d124c13b924?source=rss—-e52cf94d98af—4