How Baseten achieves 225% better cost-performance for AI inference

Baseten is one of a growing number of AI infrastructure providers, helping other startups run their models and experiments at speed and scale. Given the importance of those two factors to its customers, Baseten has just passed a significant milestone.

By leveraging the latest Google Cloud A4 virtual machines (VMs) based on NVIDIA Blackwell, and Google Cloud’s Dynamic Workload Scheduler (‘DWS’) Baseten has achieved 225% better cost-performance for high-throughput inference and 25% better cost-performance for latency-sensitive inference.

Why it matters: This breakthrough in performance and efficiency enables companies to move powerful agentic AI and reasoning models out of the lab and into production affordably. For technical leaders, this provides a blueprint for building next-generation AI products — such as real-time voice AI, search, and agentic workflows — at a scale and cost-efficiency that has been previously unattainable.

The big picture: Inference is the cornerstone of enterprise AI. As models for multi-step reasoning and decision-making demand exponentially greater compute, the challenge of serving them efficiently has become the primary bottleneck. Enter Baseten, a six-year-old Series C company that partners with Google Cloud and NVIDIA to provide enterprise companies a scalable inference platform for their proprietary models as well as open models like Gemma, DeepSeek ,and Llama, with an emphasis on performance and cost efficiency. Their success hinges on a dual strategy: maximizing the potential of cutting-edge hardware and orchestrating it with a highly optimized, open software stack.

We wanted to share more about how Baseten architected its stack — and what this new level of cost-efficiency can unlock for your inference applications.

Source Credit: https://cloud.google.com/blog/products/ai-machine-learning/how-baseten-achieves-better-cost-performance-for-ai-inference/