Building an LLM Evaluation Tool with Streamlit and Model Garden | by Haneena Atheeq | Google Cloud - Community

In our previous blog , we explored an introduction to the Model Garden, including model deployment and testing predictions. This time, we will delve into evaluating responses from two different models to aid in decision-making.

Ever wondered which large language model (LLM) is best for your project? With so many powerful options like Anthropic Claude, Gemini, Gemma and Meta Llama available through Google Cloud’s Model Garden, choosing the right one can feel like finding a needle in a haystack! In this post, we will show you how to build a simple, interactive tool using Streamlit to compare LLM responses side-by-side, making your decision-making a breeze.

The goal is to develop a tool that showcases the relative performance of various partner model APIs accessible through Model Garden. This will be built using the Streamlit framework.

Streamlit is an open-source Python library that allows you to quickly create and share beautiful, custom web apps for machine learning and data science. It transforms data scripts into shareable web apps in minutes, all in pure Python. This means you don’t need to be a web development expert to create powerful, interactive dashboards and tools — perfect for quickly testing and comparing your LLMs!

To develop this tool, we will be following the below 2 steps:

Model Deployment: Deploy Anthropic Claude 3.7 Sonnet and Meta Llama 3.3.
Application Execution: Run the application code in a Vertex AI Workbench Notebook.

To deploy the Claude model, follow these steps:

Navigate to Model Garden.
Search for “Claude 3.7 Sonnet.”
Enable the Claude 3.7 Sonnet API using the following information.

Business name: Algowave
Business Website: https://www.algowave.com
Contact email address: support@algowave.com
Where is your Business headquartered: United States of America
Industry: Telecommunications
Who are your intended users for Claude models: Internal employees
What are your intended use cases for Claude models: Complex reasoning
Additional requirements per Anthropic’s Acceptable Use Policy: No
Usage fee: Claude 3.7 Sonnet Cache Hit Input Batch Tokens

4. Select the CO Open Notebook button from the model card. Within the Colab notebook, proceed through the cells. Ensure you select Claude 3.7 (latest version) as your model and “us-east5” as your region. The notebook provides examples of both unary (non-streaming) and streaming calls for text generation and vision.

To deploy the Llama model, perform the following steps:

Navigate to Model Garden and search for “Llama 3.3”. Agree to the terms and conditions.
Select “Llama 3.3 API service” and enable it.

3. Go to Llama 3.3 and click on “Deploy options”.

4. Select “Vertex AI deployment” and provide the following details:

Resource ID: Llama-3–3–70B-Instruct
b. Deployment Settings: Advanced
Region: us-central1 (Iowa)
Machine spec: vLLM 128K context (4 NVIDIA_H100_80GB; a3-highgpu-4g)
Deployment Policy: Leave empty
Endpoint access: Public (Shared endpoint)

5. Click the “Deploy” button.

Model deployment and endpoint creation typically take 15–20 minutes. Until then, feel free to take a short walk or brew some chai or coffee.

Here’s how to build your tool calling model APIs:

1. Set up your environment:

Ensure a Vertex AI Workbench instance is created.
Open JupyterLab.
Launch the Terminal.

2. Clone the application code:

Execute the following commands in the terminal:

cd ~
git clone https://github.com/jwd-cloud/model-garden-lab-app.git
cd model-garden-lab-app

3. Configure Model Parameters:

For Claude:
– Open `model-garden-lab-app/claude.py`.
– In the `TODO` section, update the code with the following parameters:

model="claude-3–7-sonnet@20250219",
system=system_msg,
messages=messages,
max_tokens=2048,

For Llama:
– Open `model-garden-lab-app/llama.py`.
– In the `TODO` section, update the code with the following parameters:

model="meta/llama-3.3–70b-instruct-maas",
messages=messages,
stream=True,
temperature=0.5,
max_tokens=2048,

4. Update API Executor:

Open `model-garden-lab-app/mg_apis.py`.
Update the `executor.submit` expressions as shown below:

# TODO Claude Sonnet 3.7
executor.submit(
get_claude_response,
models,
prompt,
empties
),# TODO Meta Llama
executor.submit(
get_llama_response,
models,
prompt,
empties
)

These lines tell our application to submit requests to both the Claude and Llama models simultaneously, allowing us to compare their responses side-by-side.

5. Set up and Run the Application:

Execute these commands in the terminal to configure and launch the application:

# Create a new virtual environment named '.venv' in the current directory.
python3 -m venv .venv
# Activate the newly created virtual environment.
source .venv/bin/activate
# Install all Python packages listed in 'requirements.txt' into the activated virtual environment.
pip install -r requirements.txt
# Set an environment variable 'GOOGLE_CLOUD_PROJECT' to the value of your currently configured Google Cloud project ID.
export GOOGLE_CLOUD_PROJECT=$(gcloud config get-value project)
export GOOGLE_CLOUD_REGION="us-central1"
export GOOGLE_CLOUD_CLAUDE_REGION="us-east5"
# Create a Google Cloud Compute Engine firewall rule. It allows incoming TCP traffic on port 8088 from any IP address (0.0.0.0/0). This is crucial if your Streamlit app, running on port 8088, needs to be accessible from outside the virtual machine or server it's running on.
gcloud compute firewall-rules create allow-port-8088 \
- allow tcp:8088 \
- source-ranges=0.0.0.0/0 \
- description="Allow TCP 8088 to this machine from any source"
# Run the Streamlit application 'mg_apis.py'.
streamlit run mg_apis.py

6. Access the Application:

Click on the External URL link to open your application.

Congratulations! You have built an interactive tool to easily compare LLM responses and make informed decisions about which model best suits your needs!

Now, consider how you can further evaluate the model for your specific use case. How can you perform model evaluation based on metrics in Vertex AI? Stay tuned for next week’s blog post to discover the answer.

To delve deeper and explore the advanced features of Vertex AI, the official Google Cloud documentation is your best resource.

Happy learning!

Source Credit: https://medium.com/google-cloud/building-an-llm-evaluation-tool-with-streamlit-and-model-garden-acc89358511c?source=rss—-e52cf94d98af—4