Multi-Regional Inference With Vertex AI

When running mission-critical inference workloads on Vertex AI, relying on a single region is a risk. Whether due to capacity constraints or unexpected regional outages, you need a strategy to fail over seamlessly.

The ideal architecture uses a Global Load Balancer to route traffic to the nearest healthy region. However, wiring the load balancer directly to Vertex AI endpoints presents a unique challenge that standard load balancing features cannot solve.

The Problem: The “Double Rewrite” Dilemma

Vertex AI Endpoints are strictly regional, and their resource IDs are globally unique.

Region A Endpoint: …/locations/us-central1/endpoints/12345…
Region B Endpoint: …/locations/us-east4/endpoints/67890…

A standard Global Load Balancer can route traffic based on health, but it cannot dynamically rewrite the URL path to swap ID 12345 for 67890 based on which backend region it selects. While it can modify Host headers, it lacks the logic to handle dynamic path restructuring.

The Solution: The “Smart Router” Pattern

To achieve absolute failover, you can introduce an intelligent middle layer. First, deploy a lightweight Cloud Run service in each region to act as a “Smart Router.”

The Traffic Flow:

Client sends a generic request to the Global Load Balancer (https://predict.example.com/predict).
Global LB routes the request to the closest healthy Cloud Run instance.
Cloud Run (The Brain):

Authenticates the request.
Rewrites the Host Header (e.g., us-central1-aiplatform…).
Rewrites the URL Path with the specific Endpoint ID for that region.
Proxies the traffic privately to Vertex AI via Private Service Connect (PSC).

Failover: If the local Vertex endpoint fails (e.g., HTTP 503), the Cloud Run instance catches the error and immediately retries against the remote region’s PSC endpoint.

Implementation Guide

Example with dummy model at https://github.com/bernieongewe/vertex-ai-multi-regional-inference

Configure Private Service Connect (PSC) with Global Access

Create a private connection from your VPC to Vertex AI. You must enable Global Access so the Cloud Run router in us-central1 can reach the endpoint in us-east4 during a failover event.

# Repeat for both regions
gcloud compute forwarding-rules create rule-psc-${REGION} \
    --region=${REGION} --network=default \
    --address=ip-psc-${REGION} \
    --target-service-attachment=${VERTEX_SERVICE_ATTACHMENT} \
    --allow-psc-global-access  # Critical for cross-region failover

Deploy the “Smart Router” (Cloud Run)

This Python code handles the logic the Load Balancer cannot. It fixes the Protocol, Auth, and Pathing.

main.py

import os, requests, google.auth
from google.auth.transport.requests import Request
from flask import Flask, request, jsonify
import urllib3

# Suppress SSL warnings (Expected because we call a numeric IP directly)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

app = Flask(__name__)

def get_token():
    credentials, _ = google.auth.default()
    credentials.refresh(Request())
    return credentials.token

def call_vertex(ip, region, endpoint_id, payload):
    # 1. REWRITE URL: Inject the correct Endpoint ID for this region
    # Note: Use the PRODUCER Project ID (where the endpoint lives)
    project_id = os.environ.get("VERTEX_PROJECT_ID")
    url = f"https://{ip}/v1/projects/{project_id}/locations/{region}/endpoints/{endpoint_id}:predict"
    
    # 2. REWRITE HEADERS: Set Host and Authorization
    headers = {
        "Content-Type": "application/json",
        "Host": f"{region}-aiplatform.googleapis.com",
        "Authorization": f"Bearer {get_token()}"
    }
    
    # 3. PROTOCOL: Must use HTTPS with verify=False (IP doesn't match cert)
    return requests.post(url, json=payload, headers=headers, timeout=5, verify=False)

@app.route('/', defaults={'path': ''}, methods=['POST'])
@app.route('/<path:path>', methods=['POST'])
def router(path):
    payload = request.get_json()
    
    # Try Local Region
    p_ip, p_reg, p_id = os.environ.get("PRIMARY_IP"), os.environ.get("PRIMARY_REGION"), os.environ.get("PRIMARY_ID")
    try:
        resp = call_vertex(p_ip, p_reg, p_id, payload)
        if resp.status_code == 200:
            return jsonify(resp.json()), 200
    except Exception as e:
        print(f"Primary failed: {e}")

    # Failover to Backup Region
    b_ip, b_reg, b_id = os.environ.get("BACKUP_IP"), os.environ.get("BACKUP_REGION"), os.environ.get("BACKUP_ID")
    print(f"Failing over to {b_reg}...")
    
    try:
        resp = call_vertex(b_ip, b_reg, b_id, payload)
        return jsonify(resp.json()), resp.status_code
    except Exception:
        return jsonify({"error": "All regions failed"}), 500

Network Configuration for Cloud Run

Cloud Run instances are isolated by default. To reach your internal PSC IPs, you must attach them to your VPC using Direct VPC Egress.

gcloud run deploy vertex-router-central \
    --region=us-central1 \
    --network=default \
    --subnet=default \
    --vpc-egress=private-ranges-only \
    ...

The Load Balancer Setup

Use Serverless Network Endpoint Groups (NEGs) to route traffic to Cloud Run.

Critical Caveat: When creating the Backend Service, do not specify –protocol or –port-name. Serverless NEGs are incompatible with named ports and will throw an error if defined.

Inference Via The Load Balancer

Once your architecture is deployed, sending prediction requests requires a slight shift from standard Vertex AI workflows. Because the Global Load Balancer (GLB) serves as a generic entry point (e.g., https://predict.example.com/predict) and relies on the "Smart Router" to inject the specific regional Endpoint IDs, you cannot use the standard Vertex AI SDK endpoint.predict() method. The SDK is designed to automatically construct the full, specific path to a resource (e.g., …/locations/us-central1/endpoints/12345…), which bypasses the generic routing logic we've built. Instead, you must send requests using a standard HTTP client like curl or Python’s requests library, targeting the GLB's IP or domain directly. This allows the load balancer to receive the generic request and hand it off to the Cloud Run proxy for dynamic path rewriting.

# Example Curl Request to the Load Balancer
curl -k -X POST "https://<LOAD_BALANCER_IP>/predict" \
    -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
    -H "Content-Type: application/json" \
    -d '{ "instances": [[...]] }'

Other Caveats

If you are building this, watch out for these specific traps:

Protocol Mismatch: Even on private internal IPs, Vertex AI endpoints require HTTPS, not HTTP. Your code must handle the SSL handshake (ignoring validation errors since the IP won’t match the certificate).
Authentication is Mandatory: Being “private” doesn’t mean “unsecured.” You must inject a valid Google IAM Bearer token in the request header.
Project Context: If your Endpoints live in a different project (e.g., a shared VPC host project), ensure your Cloud Run code uses the Producer Project ID to construct the URL, not the Consumer Project ID.
Organization Policies: If you receive a 403 Forbidden accessing Cloud Run, your Org Policy likely blocks allUsers. You must either authenticate your client request or use Identity-Aware Proxy (IAP).

Multi-Regional Inference With Vertex AI was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/multi-regional-inference-with-vertex-ai-9c750fc7c9c3?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

Build a stronger cloud infrastructure with Azure IaaS

Welcoming Wiz to Google Cloud: Redefining security for the AI era

Introducing GPT-5.4 in Microsoft Foundry

You may have missed