
Sometimes a small model will do the trick. And you don’t always need a GPU for it. There are reasons to run “utility” tasks directly on CPUs. Sometimes you simply can’t have one. Or you’d like to keep your data local. Or you just want to keep your architecture simple.
This is where ONNX Runtime GenAI comes in. It enables you to run models where you want: GPU if you have it, CPU if you don’t. All without changing a line of code. In this post, I’ll show how it all works, including hosting your model on Google Cloud Run. You can find all the sample code in the onnx-inference repository.
Where it started: portable ML models
It started back in September 2017. Industry leaders realized that the fragmentation of ML tooling was hurting everyone, so ONNX was launched as a shared standard. The pitch was simple: a “universal translator” for models. You can train in your preferred framework, export to a universal standard, and run the model efficiently on various hardware targets.
The industry didn’t waste time. By the end of that year, more companies were onboard, and in 2018, Microsoft open-sourced ONNX Runtime, the engine designed to actually run these models efficiently on any hardware. It graduated to a Linux Foundation AI project in 2019, cementing its place as an open standard.
Where it’s going: portable LLMs
When the LLM boom hit around 2023, ONNX faced a new challenge. Traditional models are stateless: input goes in, prediction comes out. LLMs are different. They’re chatty. They have memory. They generate text token by token, needing to manage a “KV cache” to remember the conversation context.
Standard ONNX runtimes weren’t built for that loop.
So in 2024, the community released onnxruntime-genai. It wraps the core runtime with all the specific logic LLMs need: tokenization, generation loops, search strategies like beam search, and state management.
Fast forward to 2026, and we now have a library of pre-quantized ONNX models on Hugging Face. You can pull a model off the shelf and run it without any training or format conversion.
Using the library
The benefit of onnxruntime-genai is that it handles generation for you. If you tried to do this with raw ONNX Runtime, you’d be writing manual loops to feed output tokens back as inputs.
Here’s what it looks like now:
import onnxruntime_genai as og
# Load a model (this path works for CPU, GPU, or mobile automatically)
model = og.Model('path/to/model')
tokenizer = og.Tokenizer(model)
# Configure how you want to search
params = og.GeneratorParams(model)
params.set_search_options(max_length=256, batch_size=1)
# The generation loop
generator = og.Generator(model, params)
generator.append_tokens(tokenizer.encode(prompt))
while True:
generator.generate_next_token()
if generator.is_done():
break
# Decode and print as we go
token = generator.get_next_tokens()[0]
print(tokenizer.decode(token), end='', flush=True)
It’s doing a lot of heavy lifting here: handling the KV cache, applying your search strategy (greedy, top-p, etc.), and routing operations to the best available hardware (CUDA, CoreML, or CPU).
Hardware, models, and quantization
A few things have changed since the early days of LLMs. Processors have gotten faster, and models are incredibly capable at much smaller sizes.
Then there’s quantization. We’re no longer bound to running models at full 32-bit precision. Techniques like INT4 quantization crunch weights down significantly with surprisingly little impact on accuracy.
I’ve tested quite a few models with onnx-inference. Note that these small models recognize structure, but are not well-suited for knowledge-intensive tasks.
For very simple tasks, you can get away with a tiny model like SmolLM2–135M. It’s great for basic completion or classification.
For more complex tasks, you’ll want a larger model. Qwen 3–0.6B gives you much more capability for a few hundred million more parameters.
As you consider models beyond 500M parameters on CPU, you’ll need to carefully monitor the context window, max tokens, and latency you can tolerate.
https://medium.com/media/06a7175a67cd50dd58104c44bc842928/href
Building a portable server
Wrapping the inference logic in a lightweight server makes it universally accessible. It becomes easy to call from any service regardless of language. The code in my repo provides a FastAPI server that does exactly this.
The main class is OnnxTextGenerator, which handles the inference logic:
from inference import OnnxTextGenerator
# Auto-detects hardware
generator = OnnxTextGenerator()
# Simple run
result = generator.generate(
prompt="Explain quantum computing like I'm five:",
max_new_tokens=100,
temperature=0.7
)
print(result['generated_text'])
For real-time apps, you can’t wait for the whole answer. You can stream it instead:
for chunk, metadata in generator.stream_generate(
prompt="Write a haiku about Docker:",
max_new_tokens=50,
temperature=0.8
):
print(chunk, end='', flush=True)
Each function is exposed as an endpoint in FastAPI:
@app.post("/generate")
async def generate(request: GenerateRequest):
result = generator.generate(
prompt=request.prompt,
max_new_tokens=request.max_new_tokens,
temperature=request.temperature
)
return {
"generated_text": result["generated_text"],
"tokens_generated": result["tokens_generated"],
"finish_reason": result["finish_reason"]
}
I wanted something that “just worked,” so the initializer automatically hunts for the best hardware execution provider: CUDA (NVIDIA GPUs) → CoreML (Apple Silicon) → CPU (Universal fallback).
Containerization strategy
For smaller models like the ones I listed earlier, a convenient pattern is to bake the model directly into the image. This makes them instantly available upon deployment without waiting for downloading.
Here’s a simplified view of the Dockerfile I used:
FROM python:3.12-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
ARG MODEL_ID=onnx-community/SmolLM2-135M-Instruct-ONNX
RUN hf download ${MODEL_ID} --local-dir /app/model
COPY . /app
WORKDIR /app
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
When you load this container, the model is already there on the disk. It starts instantly, and works without Internet access.
Going serverless on Google Cloud Run
Cloud Run is perfect for apps using small models. It scales to zero when you aren’t using it, so you stop paying for resources. Since we’re using CPU, we don’t have to provision GPU instances either.
Deploying from source
We can deploy directly from the source code. The following command builds the container using Google Cloud Build and deploys it to Cloud Run in one step.
We’re intentionally setting a few specific flags here. We allocate 2 CPUs because inference is compute-bound, and 4Gi of memory to comfortably hold a small model and the KV cache. We also set concurrency to 4, which allows the instance to handle a few simultaneous requests without thrashing the cache.
gcloud run deploy onnx-inference \
--allow-unauthenticated \
--concurrency 4 \
--cpu 2 \
--labels dev-tutorial=onnx-inference \
--memory 4Gi \
--region us-central1 \
--source .
Testing it out
Once the deployment finishes, we need to grab the secure URL of the new service. We can then test it with a simple curl command.
SERVICE_URL=$(gcloud run services describe onnx-inference \
--region $REGION \
--format 'value(status.url)')
curl -X POST "$SERVICE_URL/generate" \
-H "Content-Type: application/json" \
-d '{"prompt": "Why is efficient AI important?", "max_new_tokens": 50}'
Troubleshooting
A few things might trip you up:
- Missing genai_config.json: Not all Hugging Face ONNX models include a configuration for the GenAI library. My library tries to infer the config if it’s missing, but it’s better to use models that have it.
- Execution providers: The sample currently includes CUDA, CoreML, and CPU, but it would be straightforward to add other providers such as TensorRT or OpenVINO.
- Parameters: As you scale up max_new_tokens, the KV cache grows and the attention mechanism performs more computations. Keep an eye on memory usage and latency.
Wrapping up
Smaller models have come a long way. With ONNX Runtime GenAI and modest quantization, you can run capable LLMs in places that seemed impossible a few years ago.
This opens up a whole new class of applications: completely private local assistants, intelligent edge devices, and serverless APIs that cost next to nothing to maintain.
If you want to try this out, it takes about 5 minutes to get started. You can grab the code from the onnx-inference repository on GitHub, check out the official ONNX Runtime GenAI documentation for deeper details, or browse the Hugging Face ONNX Community to find your next model.
Share what you’ve built with me on LinkedIn, X, or Bluesky!
Run LLMs anywhere: Local and CPU inference with ONNX Runtime GenAI was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/run-llms-anywhere-local-and-cpu-inference-with-onnx-runtime-genai-9bc34dbf0d7d?source=rss—-e52cf94d98af—4
