We know that every AI model has its own unique strengths for coding, reasoning, or creative tasks. Imagine being able to switch between them seamlessly from your favorite development tools. The reality often involves juggling authentication methods, endpoints, and configuration for each new model you want to try.
The good news? Platforms like Google Cloud’s Vertex AI Model Garden are making this vision a reality. They offer fully managed access to a diverse collection of models, including powerful open models like Qwen3 Coder, all through standardized, OpenAI-compatible endpoints. This means you can leverage the same familiar API patterns across different models, dramatically simplifying integration.
How do you connect your local development tools to these endpoints? That’s where LiteLLM comes in. This open-source tool runs a lightweight local proxy that acts as a universal translator between your development environment and the Vertex AI APIs. By routing your AI requests through LiteLLM, you gain the ability to seamlessly switch between different models in the Model Garden.
This step-by-step guide will show how to make this happen on macOS, but the principles can be easily adapted to other environments like Windows or Linux.
Part 1: Getting the Foundation Right
First, prepare your local machine for Google Cloud. For macOS, install the gcloud CLI using Homebrew: brew install gcloud
Next, authenticate your local environment. This command will open a browser for you to log in: gcloud auth application-default login
With authentication complete, you need to grant your user account the necessary permissions to call the Vertex AI endpoint. Replace your-gcp-project-id with your actual Project ID and your-email@example.com with the email you used to authenticate:
gcloud projects add-iam-policy-binding your-gcp-project-id — member=”user:your-email@example.com” — role=”roles/aiplatform.user”
Next, enable the Vertex AI API for your project by running: gcloud services enable aiplatform.googleapis.com
Finally, you must enable the specific model you intend to use. For this guide, go to the Qwen3 Coder model card and click the “Enable” button.
Locally, let’s install the uv package manager and redis. While you don’t need Redis for a basic workflow, we’re using it here for performance and to persist the cache over multiple sessions. Install uv and redis with Homebrew, then start the Redis service:
brew install uv redis
brew services start redis
Part 2: Configuring and Launching the LiteLLM Proxy
This is where it all comes together. We’ll create the config.yaml file that tells LiteLLM how to connect to our Vertex AI model.
2.1 Creating the config.yaml
This file tells LiteLLM how to connect to our Vertex AI model and manages the proxy’s behavior. First, make the directory for LiteLLM’s configuration if it’s not already there. Then, create the file:
mkdir -p ~/.config/litellm/
touch ~/.config/litellm/config.yaml
Copy and paste this into your ~/.config/litellm/config.yaml file. Just remember to replace your-gcp-project-id with your actual GCP Project ID.
model_list:
- model_name: qwen3-coder
litellm_params:
model: vertex_ai/qwen/qwen3-coder-480b-a35b-instruct-maas
vertex_project: your-gcp-project-id
vertex_location: us-south1
# You can add more models (after enabling them)
# - model_name: qwen3-235b
# litellm_params:
# model: vertex_ai/qwen/qwen3-235b-a22b-instruct-2507-maas
# vertex_project: your-gcp-project-id
# vertex_location: us-south1litellm_settings:
drop_params: true
cache: true
cache_params:
type: "redis"
host: "localhost"
port: 6379
prompt_cache:
type: "redis"
host: "localhost"
port: 6379
general_settings:
master_key: sk-1234
2.2 What’s in the Config?
The model_list section defines all the models your proxy can handle.
- model_name: This is the friendly name you’ll use in your client app (like Cline).
- litellm_params.model: This is the specific identifier for the model provider. The
vertex_ai/format tells LiteLLM to use its Vertex AI integration./ - vertex_project & vertex_location: These point to the right Google Cloud project and region.
The litellm_settings section configures core proxy behaviors.
- drop_params: When set to true, it automatically removes any parameters that are not supported by the target model. You can read more in the official documentation.
- cache: Setting this to true enables response caching.
- cache_params: This tells LiteLLM to use Redis to store responses. If you send the exact same request again, it will return the cached response instantly.
The general_settings section has global settings for the proxy.
- master_key: This is a simple API key to authenticate with your local proxy. While sk-1234 is fine for local testing, you should use a securely generated key in any shared or production environment.
The prompt_cache is a feature for reducing cost and latency. It caches the static parts of your prompts (like long system messages) so they don’t have to be re-sent with every request. This is different from the general response cache, which caches the entire completion.
2.3 Running the Proxy for a Test
With the config file ready, let’s fire up the proxy for a test run.
Run this from any directory in your terminal:
uvx --with google-cloud-aiplatform 'litellm[proxy]' --config ~/.config/litellm/config.yaml --detailed_debug
This command uses uvx to run the litellm proxy in an isolated environment, telling it to include the google-cloud-aiplatform and litellm[proxy] packages. The --config flag points to your configuration file, and --detailed_debug provides verbose logs for easier troubleshooting.
You’ll see startup logs in your terminal. Once you see Application startup complete., the proxy is running and listening on http://0.0.0.0:4000.
For a full list of all available options, you can refer to the official Configuration Settings documentation.
Part 3: Integrating with Cline in VS Code
With the proxy running, the last piece is telling your AI coding assistant to use it. While this guide uses Cline for the example, this approach will work with any tool that supports OpenAI-compatible endpoints, such as Roo Code or Continue.dev.
3.1 Configuring the Cline VS Code Extension
- Install Cline: In VS Code, go to Extensions (
Cmd+Shift+X), search for “Cline” and install it. - Open Cline Settings: Click the gear icon (⚙️) in the Cline sidebar.
- Configure the Provider:
- API Provider: Choose LiteLLM.
- Base URL: Enter
http://0.0.0.0:4000. - API Key: Paste the master_key from your config.yaml (e.g.,
sk-1234). - Model: Enter the model_name from your config (
qwen3-coder).
Now, Cline will send all its AI requests through your local proxy.
Part 4: Making It Persistent on macOS
Running the proxy in a terminal is fine for testing, but you probably don’t want to do that every day. Let’s make it a background service. On macOS, we’ll use launchd, the standard way to manage daemons and agents. For Windows users, an equivalent is NSSM (the Non-Sucking Service Manager), which can be used to run the same uvx command as a Windows service.
4.1 Creating a launchd User Agent
Create a new property list (.plist) file in your user’s LaunchAgents directory:
touch ~/Library/LaunchAgents/ai.litellm.proxy.plist
Open this file and paste in the following XML. You’ll need to replace /path/to/your/uvx with the absolute path on your system (find it with which uvx) and /path/to/your/home with your home directory path (e.g., /users/yourusername).
Label
ai.litellm.proxy
ProgramArguments
/path/to/your/uvx
--with
google-cloud-aiplatform
litellm[proxy]
--config
/path/to/your/home/.config/litellm/config.yaml
RunAtLoad
KeepAlive
StandardOutPath
/path/to/your/home/Library/Logs/litellm/litellm.stdout.log
StandardErrorPath
/path/to/your/home/Library/Logs/litellm/litellm.stderr.log
WorkingDirectory
/path/to/your/home
4.2 Deploy and Manage the Agent
The RunAtLoad key in the .plist file ensures this service will automatically start every time you log in to your Mac. Before you load it for the first time, make sure you’ve stopped the proxy if it’s still running from your earlier test (Ctrl+C in the terminal).
Here are the commands to manage it:
- Load:
launchctl load ~/Library/LaunchAgents/ai.litellm.proxy.plist - Unload:
launchctl unload ~/Library/LaunchAgents/ai.litellm.proxy.plist - Restart: To apply changes from your config.yaml, simply unload and then load the agent again. You can chain the commands:
launchctl unload ~/Library/LaunchAgents/ai.litellm.proxy.plist && \
launchctl load ~/Library/LaunchAgents/ai.litellm.proxy.plist
- Verify:
launchctl list | grep litellm - Troubleshoot: Check the log files in
~/Library/Logs/litellm/. You can tail them to see live output:
tail -f ~/Library/Logs/litellm/litellm.stdout.log \
~/Library/Logs/litellm/litellm.stderr.log
From Local Playground to Production Architecture
You now have instant access to the rich ecosystem of models in the Vertex AI Model Garden, able to switch between specialized models like Qwen3 Coder for development tasks and others for creative work.
This local setup is just the beginning. You can evolve it into a production-ready system by replacing your local Redis with a managed service like Google Cloud Memorystore for Redis for a shared, persistent cache. Then, containerize your LiteLLM proxy and deploy it with Cloud Run to create a centralized, scalable AI gateway for your entire team.
You’ve built a flexible architecture with a clear path from experimentation to production. The tools are in place, the path is clear, and the AI ecosystem is waiting. Share what you’ve built with me on LinkedIn, X, or Bluesky!
Source Credit: https://medium.com/google-cloud/build-with-your-favorite-models-from-the-vertex-ai-model-garden-with-litellm-0b140bf52a01?source=rss—-e52cf94d98af—4
