Step-by-Step: Serving PyTorch Models with a Custom Handler on Vertex AI | by Remigiusz Samborski | Google Cloud - Community

PLLuM falling into a fruit compote (inspired by a Polish wordplay)

Google Cloud’s Model Garden streamlines the process of deploying various open-source models — including those from Anthropic, Meta, and Hugging Face — into production-ready, scalable APIs.

However, challenges arise when models need special preprocessing, have unconventional output formats, or require unique logic not found in standard serving containers. These issues can significantly impede project progress.

The solution is to gain more control over the inference pipeline. This is where Google Cloud’s Vertex AI shines, offering a powerful combination of pre-built Hugging Face containers and the flexibility of custom handlers. By writing a simple Python script, you can dictate exactly how your model loads, processes requests, and formats predictions.

In this guide, I’ll walk you through the entire process, from development to deployment. You will learn how to:

Understand and build a custom inference handler for a Hugging Face model.
Test your model and handler locally to speed up debugging.
Package your model and custom code for deployment.
Deploy the model to a scalable Vertex AI Endpoint with GPU acceleration.
Get live predictions from your newly created API.

We will use PLLuM, a powerful Polish language model, as our practical example, but the techniques you learn here are applicable to countless other PyTorch-based models.

This model is a great example for a couple of reasons:

It’s increasingly popular in Poland.
It’s not directly accessible through Model Garden
It won’t work in a standard Hugging Face container. This is because it employs a custom tokenizer, necessitating the implementation of encoding/decoding logic before invoking its generation function.

To follow along, you’ll need a Google Cloud project and the right tools and permissions. You can execute the code in your favorite Python development environment (i.e. locally, Cloud Workstation, Google Colab, etc.).

Google Cloud Project: Ensure you have a Google Cloud project with the Vertex AI and Artifact Registry APIs enabled.
Cloud Storage Bucket: Create a new Cloud Storage bucket to store your model files. This will act as the staging area for Vertex AI.
Permissions: Make sure your account has the following IAM roles:
– Vertex AI User (roles/aiplatform.user)
– Artifact Registry Reader (roles/artifactregistry.reader)
– Storage Object Admin (roles/storage.objectAdmin)
Docker (optional but recommended): To test your model container locally, before deploying to the cloud, you will need to have Docker installed and running.
Required Libraries: Install the necessary Python libraries:

pip install - upgrade - user - quiet 'torch' 'torchvision' 'torchaudio'
pip install - upgrade - user - quiet 'transformers' 'accelerate>=0.26.0'
pip install - upgrade - user - quiet 'google-cloud-aiplatform[prediction]' 'crcmod' 'etils'Docker (optional but recommended): To test your model container locally, before deploying to the cloud, you will need to have Docker installed and running.

Once your environment is configured, import libraries and initialize the Vertex AI SDK. You should put this code in two files: test_local.py (for local testing) and deploy.py (for cloud deployment):

import json
import torch
import vertexaifrom etils import epath
from google.cloud import aiplatform
from google.cloud.aiplatform import Endpoint, Model
from google.cloud.aiplatform.prediction import LocalModel
# Set your project, location, and bucket details
PROJECT_ID = "your-gcp-project-id"
LOCATION = "your-gcp-location"  # example: "us-central1"
BUCKET_URI = "gs://your-gcs-bucket-name"
# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

When you deploy a model on Vertex AI using a pre-built container, the container makes assumptions about how to load the model and process predictions. For many standard models, this works perfectly.

However, a custom handler gives you control over this process. It’s a Python script, named handler.py, that you provide alongside your model files. The Vertex AI serving container will automatically find and use this script.

The handler.py needs to implement the EndpointHandler class, that must define two key methods:

__init__: This method is called once when the model is loaded. Its job is to load your model and any other necessary assets (like a tokenizer) from the model directory into memory.
__call__: This method is called for every prediction request. It contains the core inference logic:

Pre-processing: Preparing the raw input data (e.g., tokenizing a prompt).
Prediction: Running the processed input through the model.
Post-processing: Formatting the model’s output into a user-friendly response.

By implementing this simple class, you can serve virtually any model, no matter how custom its requirements are.

Let’s build a handler for the CYFRAGOVPL/PLLuM-12B-chat model. Our goal is to create a simple text-generation endpoint.

First we need to make sure we have correct imports. Create a file named handler.py and copy the following lines into it:

from typing import Any, Dict, List
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import base64
from io import BytesIO
import logging
import sys

A good practice is to setup logging to stdout, so it can be accessed via Google Cloud’s observability services:

# Configure logging to output to stdout
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[logging.StreamHandler(sys.stdout)]
)
logger = logging.getLogger('huggingface_inference_toolkit')

Then we define our __init__ function, which will be responsible for loading the model and its Tokenizer:

class EndpointHandler:
def __init__(
self,
model_dir: str = '/opt/huggingface/model',
**kwargs: Any,
) -> None:
self.processor = AutoTokenizer.from_pretrained(model_dir)self.model = AutoModelForCausalLM.from_pretrained(
model_dir,
torch_dtype=torch.bfloat16,
device_map="auto"  # automatically places model layers on available devices
).eval()

Lastly, let’s define the inference logic that goes inside our handler’s __call__ method. This involves taking a prompt, tokenizing it, generating a response with the model, and decoding the output:

   def __call__(self, data: Dict[str, Any]) -> Dict[str, List[Any]]:
logger.info("Processing new request")
predictions = []for instance in data['instances']:
logger.info(f"Processing instance: {instance.get('prompt', '')[:100]}...")
if "prompt" not in instance:
error_msg = "Missing prompt in the request body"
logger.info(error_msg)
return {"error": "Missing prompt in the request body"}
inputs = self.processor(
instance["prompt"], return_tensors="pt", return_token_type_ids=False
).to(self.model.device)
input_len = inputs["input_ids"].shape[-1]
logger.info(f"Input processed, length: {input_len}")
with torch.inference_mode():
generation_kwargs = data.get(
"generation_kwargs", {
"max_new_tokens": 100,
"do_sample": False,
"top_k": 50,
"top_p": 0.9,
"temperature": 0.7
}
)
logger.info(f"Generation kwargs: {generation_kwargs}")
generation = self.model.generate(**inputs, **generation_kwargs)
generation = generation[0][input_len:]
response = self.processor.decode(generation, skip_special_tokens=True)
logger.info(f"Generated response: {response[:100]}...")
predictions.append(response)
logger.info(f"Successfully processed {len(predictions)} instances")
return {"predictions": predictions}

Note that __call__ method receives a dictionary and should be written in a way that handles multiple instances. This allows users to send multiple prompts in a single request.

Click here to download the full handler.py code.

Vertex AI needs all your model artifacts — the model weights, configuration, and our new handler.py — to be in one location on Google Cloud Storage.

Create a local directory that contains the model files and your handler.
Upload the entire directory to your GCS bucket.

gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp -r /path/to/your/local/model_directory/* gs://your-gcs-bucket-name/model/

Deploying a model to a GPU-accelerated endpoint can take 15–20 minutes. To avoid waiting that long just to find a bug in your code, you can use the Vertex AI SDK’s LocalModel feature to simulate the cloud environment on your local machine.

This spins up the official Hugging Face serving container using Docker and loads your model and handler from a local directory, allowing for rapid testing.

We define a helper function by adding following lines to the test_local.py file:

def get_cuda_device_names():
"""A function to get the list of NVIDIA GPUs"""
if not torch.cuda.is_available():
return Nonereturn [str(i) for i in range(torch.cuda.device_count())]

2. Create LocalModel instance:

local_pllum_model = LocalModel(
serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-3.transformers.4-48.ubuntu2204.py311",
serving_container_ports=[5000],
)

3. Create a LocalEndpoint instance:

model_uri = epath.Path(BUCKET_URI) / "model"local_pllum_endpoint = local_pllum_model.deploy_to_local_endpoint(
artifact_uri=str(model_uri), gpu_device_ids=get_cuda_device_names()
)
local_pllum_endpoint.serve()

4. Generate predictions:

# EN:"Write a short poem about spring."
prompt = "Napisz krótki wiersz o wiośnie."  # @param {type: "string"}prediction_request = {
"instances": [
{
"prompt": prompt,
"generation_kwargs": {"max_new_tokens": 50, "do_sample": True},
}
]
}
vertex_prediction_request = json.dumps(prediction_request)
vertex_prediction_response = local_pllum_endpoint.predict(
request=vertex_prediction_request, headers={"Content-Type": "application/json"}
)
print(vertex_prediction_response.json()["predictions"])

Click here to download the full test_local.py code.

If the local prediction succeeds, you can be much more confident that your cloud deployment will work correctly.

With our model and handler tested and uploaded to GCS, we’re ready for the final two steps.

Step 1: Register the Model in the Vertex AI Model Registry

First, we register the model, telling Vertex AI where to find the artifacts and which container to use. Add following code to the deploy.py file:

model_uri = epath.Path(BUCKET_URI) / "model"model = Model.upload(
display_name="cyfragovpl--pllum-12b-it",
artifact_uri=str(model_uri),
serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-3.transformers.4-48.ubuntu2204.py311",
serving_container_ports=[8080],
)
model.wait()

Step 2: Deploy the Model to an Endpoint

Next, we deploy the registered model to an endpoint. This is where Vertex AI provisions the physical hardware (like an NVIDIA L4 GPU) and makes your model available to receive prediction requests.

deployed_model = model.deploy(
endpoint=Endpoint.create(display_name="cyfragovpl--pllum-12b-it-endpoint"),
machine_type="g2-standard-8",
accelerator_type="NVIDIA_L4",
accelerator_count=1,
)

This step will take about 15–25 minutes. Once complete, you will have a fully managed, scalable HTTP endpoint for your model.

Now for the fun part. You can send requests to your endpoint using the Vertex AI SDK, a simple cURL command, or any HTTP client.

Using the VertexAI’s Python SDK is the most straightforward way:

# EN:"Write a short poem about spring."
prompt = "Napisz krótki wiersz o wiośnie."  # @param {type: "string"}
prediction_request = {
"instances": [
{
"prompt": prompt,
"generation_kwargs": {"max_new_tokens": 50, "do_sample": True},
}
]
}prediction = deployed_model.predict(instances=prediction_request["instances"])
print(prediction)

Click here to download the full deploy.py code.

The output will be a prediction object containing the generated text from the PLLuM model, served live from your own custom API endpoint 🎉

You have successfully taken an open-source Hugging Face model with custom requirements and transformed it into a robust, scalable API on Google Cloud. You now have the power to productionize a vast range of models by creating a simple custom handler that tailors the inference process to your exact needs.

Explore more at:

For this content in other formats visit:

Thank you for reading! I hope this guide helps you bring your own creative AI projects to life on Google Cloud. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.

I’m always eager to connect with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky. Your feedback is incredibly valuable, so please don’t hesitate to leave a comment with your thoughts, questions, or your own experiences deploying models on Vertex AI!

Source Credit: https://medium.com/google-cloud/step-by-step-serving-pytorch-models-with-a-custom-handler-on-vertex-ai-5ada1d01c534?source=rss—-e52cf94d98af—4