BigQuery’s Managed Inference for Open Models: Your Warehouse is Now an AI Engine

How Google Cloud is democratizing AI inference for data teams — no MLOps expertise required

Introduction

If you’ve ever wanted to run machine learning models on your BigQuery data but dreaded the infrastructure setup, model deployment pipelines, and ongoing maintenance — this one’s for you.

Google Cloud just launched BigQuery managed inference for open models (in Preview), and it fundamentally changes how data professionals can leverage AI. The premise is simple but powerful: deploy and run any open model from Hugging Face or Vertex AI Model Garden using nothing but SQL.

No Kubernetes clusters. No endpoint management. No surprise bills from forgotten deployments.

In this guide, I’ll walk you through everything you need to know to get started, including real-world examples you can adapt for your own use cases.

BigQuery’s Managed Inference for Open Models: Your Warehouse is Now an AI Engine

The Problem This Solves

Let’s be honest about the traditional workflow for running ML inference on data warehouse data:

Export data from BigQuery to a storage bucket
Set up a compute environment (VM, Kubernetes, or managed service)
Download and configure the model
Write inference scripts
Manage scaling, monitoring, and cost optimization
Load results back into BigQuery

This process requires multiple tools, different skill sets, and significant operational overhead. For many data teams, this friction means AI capabilities remain out of reach — even when the models themselves are freely available.

BigQuery’s managed inference collapses this entire workflow into two SQL statements.

Current workflow to achieve this:

With this new feature:

Key Features at a Glance

Before diving into the practical guide, here’s what makes this capability stand out:

Feature Benefit Single SQL deployment Create models with one CREATE MODEL statement Auto-scaling Resources scale based on workload demand Automatic resource cleanup Idle endpoints are automatically undeployed Customizable compute Choose machine types, GPU configs, and replica counts Unified billing All costs visible in your BigQuery billing Native SQL inference Use familiar AI.GENERATE_TEXT and AI.GENERATE_EMBEDDING functions

Practical Guide: Getting Started

Prerequisites

Before you begin, ensure you have:

A Google Cloud project with BigQuery enabled
Appropriate IAM permissions (BigQuery Admin or equivalent)
A default BigQuery connection configured (or create one)

Example 1: Text Embeddings with Hugging Face Models

Let’s start with a common use case — generating text embeddings for semantic search or clustering. We’ll use the popular sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face.

Step 1: Create the Model

CREATE OR REPLACE MODEL `your_project.your_dataset.embedding_model`
REMOTE WITH CONNECTION DEFAULT
OPTIONS (
  hugging_face_model_id = 'sentence-transformers/all-MiniLM-L6-v2'
);

That’s it. BigQuery will automatically:

Pull the model from Hugging Face
Provision appropriate compute resources
Deploy the model to a Vertex AI endpoint
Make it available for inference

Note: Initial deployment typically takes 3–10 minutes depending on model size.

Step 2: Generate Embeddings

Now let’s generate embeddings for product descriptions:

SELECT 
  product_id,
  product_name,
  ml_generate_embedding_result
FROM
  AI.GENERATE_EMBEDDING(
    MODEL `your_project.your_dataset.embedding_model`,
    (
      SELECT 
        product_id,
        product_name,
        product_description AS content
      FROM `your_project.your_dataset.products`
      WHERE product_description IS NOT NULL
      LIMIT 1000
    )
  );

Step 3: Store Embeddings for Vector Search

You can store these embeddings directly in a BigQuery table for later use:

CREATE OR REPLACE TABLE `your_project.your_dataset.product_embeddings` AS
SELECT 
  product_id,
  product_name,
  ml_generate_embedding_result AS embedding
FROM
  AI.GENERATE_EMBEDDING(
    MODEL `your_project.your_dataset.embedding_model`,
    (
      SELECT 
        product_id,
        product_name,
        product_description AS content
      FROM `your_project.your_dataset.products`
      WHERE product_description IS NOT NULL
    )
  );

Example 2: Text Generation with Gemma

Now let’s use Google’s Gemma model for text generation tasks like summarization or content creation.

Step 1: Create the Model

CREATE OR REPLACE MODEL `your_project.your_dataset.gemma_model`
REMOTE WITH CONNECTION DEFAULT
OPTIONS (
  model_garden_model_name = 'publishers/google/models/gemma3@gemma-3-1b-it'
);

Step 2: Summarize Customer Reviews

SELECT 
  review_id,
  original_review,
  ml_generate_text_result AS summary
FROM
  AI.GENERATE_TEXT(
    MODEL `your_project.your_dataset.gemma_model`,
    (
      SELECT 
        review_id,
        review_text AS original_review,
        CONCAT(
          'Summarize this customer review in 2 sentences: ',
          review_text
        ) AS prompt
      FROM `your_project.your_dataset.customer_reviews`
      WHERE LENGTH(review_text) > 200
      LIMIT 100
    )
  );

Step 3: Sentiment Classification

SELECT 
  review_id,
  review_text,
  ml_generate_text_result AS sentiment
FROM
  AI.GENERATE_TEXT(
    MODEL `your_project.your_dataset.gemma_model`,
    (
      SELECT 
        review_id,
        review_text,
        CONCAT(
          'Classify the sentiment of this review as POSITIVE, NEGATIVE, or NEUTRAL. ',
          'Only respond with one word. Review: ',
          review_text
        ) AS prompt
      FROM `your_project.your_dataset.customer_reviews`
      LIMIT 500
    )
  );

Example 3: Building a Content Tagging Pipeline

Here’s a more comprehensive example — automatically tagging support tickets with categories:

-- Step 1: Create the model (if not already created)
CREATE OR REPLACE MODEL `your_project.your_dataset.tagging_model`
REMOTE WITH CONNECTION DEFAULT
OPTIONS (
  model_garden_model_name = 'publishers/google/models/gemma3@gemma-3-1b-it'
);

-- Step 2: Tag tickets and store results
CREATE OR REPLACE TABLE `your_project.your_dataset.tagged_tickets` AS
WITH tagged AS (
  SELECT 
    ticket_id,
    ticket_subject,
    ticket_body,
    ml_generate_text_result AS raw_tags
  FROM
    AI.GENERATE_TEXT(
      MODEL `your_project.your_dataset.tagging_model`,
      (
        SELECT 
          ticket_id,
          ticket_subject,
          ticket_body,
          CONCAT(
            'Categorize this support ticket into one or more of these categories: ',
            'BILLING, TECHNICAL, ACCOUNT, SHIPPING, RETURNS, GENERAL. ',
            'Return only the category names separated by commas. ',
            'Subject: ', ticket_subject, ' ',
            'Body: ', ticket_body
          ) AS prompt
        FROM `your_project.your_dataset.support_tickets`
        WHERE status = 'NEW'
      )
    )
)
SELECT 
  ticket_id,
  ticket_subject,
  ticket_body,
  raw_tags,
  SPLIT(TRIM(raw_tags), ',') AS tag_array
FROM tagged;

Cost Management: Best Practices

One of the biggest concerns with ML infrastructure is runaway costs. BigQuery’s managed inference includes built-in safeguards, but here’s how to optimize further:

1. Configure Idle Timeout

Set an automatic undeployment window to prevent paying for idle resources:

CREATE OR REPLACE MODEL `your_project.your_dataset.cost_optimized_model`
REMOTE WITH CONNECTION DEFAULT
OPTIONS (
  hugging_face_model_id = 'sentence-transformers/all-MiniLM-L6-v2',
  endpoint_idle_ttl = INTERVAL 2 HOUR  -- Undeploy after 2 hours of inactivity
);

2. Manual Deployment Control

For scheduled batch jobs, manually control when the endpoint is active:

-- Before your batch job: Deploy the model
ALTER MODEL `your_project.your_dataset.my_model`
SET OPTIONS(deploy_model = TRUE);

-- Run your inference queries...

-- After your batch job: Undeploy to stop costs
ALTER MODEL `your_project.your_dataset.my_model`
SET OPTIONS(deploy_model = FALSE);

3. Right-Size Your Compute

For production workloads, specify machine types and replicas:

CREATE OR REPLACE MODEL `your_project.your_dataset.production_model`
REMOTE WITH CONNECTION DEFAULT
OPTIONS (
  hugging_face_model_id = 'sentence-transformers/all-MiniLM-L6-v2',
  machine_type = 'n1-standard-4',
  min_replica_count = 1,
  max_replica_count = 3,
  endpoint_idle_ttl = INTERVAL 4 HOUR
);

4. Clean Up When Done

Don’t forget to drop models you no longer need:

DROP MODEL IF EXISTS `your_project.your_dataset.my_model`;

This automatically cleans up all associated Vertex AI resources.

Real-World Use Cases

E-commerce — Generate embeddings for product catalogs to power “similar items” recommendations, automatically summarize lengthy customer reviews, and improve search relevance by understanding query intent.

Customer Support — Classify and route incoming tickets to the right teams, generate draft responses, and monitor sentiment trends over time.

Content & Media — Auto-tag and categorize content, detect duplicates using semantic similarity, and generate summaries for long-form articles.

Data Quality — Detect PII in text fields, standardize inconsistent entries, and flag anomalies in your data.

Limitations and Considerations

While this feature is powerful, keep these points in mind:

Preview status: The feature is in Preview, so APIs may change before GA
Cold start latency: Initial inference after deployment can take longer; subsequent calls are faster
Model compatibility: Not all Hugging Face models are supported; check documentation for compatible model types
Regional availability: Ensure your BigQuery dataset and connection are in supported regions
Quota limits: Vertex AI quotas apply; request increases for production workloads

What This Means for Data Teams

This launch represents a significant shift in how organizations can approach AI adoption:

For Data Analysts: You can now experiment with ML models without leaving your SQL environment or waiting for engineering resources.

For Data Engineers: Building ML-powered data pipelines becomes dramatically simpler — no separate ML infrastructure to maintain.

For ML Engineers: Focus on model selection and prompt engineering rather than deployment logistics.

For Organizations: Lower barrier to AI adoption means more teams can leverage these capabilities, accelerating AI initiatives across the business.

Getting Started Checklist

Here’s the quick-start checklist:

Ensure you have a Google Cloud project with BigQuery enabled
Verify you have the necessary IAM permissions
Set up a default BigQuery connection (or create one)
Start with a small test dataset (use LIMIT clauses)
Experiment with different models from Hugging Face
Configure endpoint_idle_ttl to manage costs
Scale up once you’ve validated your approach

Resources

Official Documentation: Creating automatically-deployed open models
Tutorial: Generate text with the Gemma model
Tutorial: Generate text embeddings with open models
Hugging Face Model Hub
Vertex AI Model Garden

Conclusion

BigQuery’s managed inference for open models removes one of the last major barriers between data teams and AI capabilities. By bringing the entire ML inference workflow into SQL, Google Cloud has made it possible for anyone who can write a query to leverage state-of-the-art models.

The implications are significant: faster experimentation, reduced operational overhead, and democratized access to AI across organizations.

If you’ve been waiting for the right moment to start incorporating ML into your data workflows, that moment is now.

BigQuery’s Managed Inference for Open Models: Your Warehouse is Now an AI Engine was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/bigquerys-managed-inference-for-open-models-your-warehouse-is-now-an-ai-engine-d83fbb6eccd1?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

Build a stronger cloud infrastructure with Azure IaaS

Welcoming Wiz to Google Cloud: Redefining security for the AI era

Introducing GPT-5.4 in Microsoft Foundry

You may have missed