
- Lakehouse Trend: A central theme is the adoption of the lakehouse architecture, which combines the flexibility of data lakes with the power and governance of data warehouses. BigQuery is positioned as the core of this modern lakehouse.
- Apache Iceberg Integration: Google Cloud is heavily betting on Apache Iceberg as an open standard for table formats. BigQuery is extending its support for Iceberg beyond simple reads (via BigLake) by introducing managed BigQuery tables for Apache Iceberg. These aim to offer the best of both worlds: the openness and interoperability of Iceberg with the performance, scalability, simplified management (automatic optimization, high-performance streaming, integrated security like soft deletes, auto-tiering), and enterprise features of native BigQuery tables.
- Unified Metastore: The BigQuery Metastore becomes central, supporting Hive APIs and the Iceberg catalog. It enables seamless interoperability between different engines (BigQuery, Spark, Flink, Presto/Trino) accessing the same data (on GCS or BigQuery storage) via a single schema, eliminating duplication and ensuring consistency. Spotify, for example, uses this approach to unify access to its GCS data from BigQuery and Dataflow/Spark, drastically reducing duplication and costs. CME Group is also adopting this vision for its unified platform.
- BigLake: Continues to enable querying data in various formats (Iceberg, Parquet, Avro…) on GCS and other clouds as if they were native BigQuery tables.
- Gemini as an AI Assistant: Gemini is deeply integrated into BigQuery to assist users at every step:
- Data Preparation: BigQuery Data Preparation (now GA and included in BQ pricing) uses Gemini to detect issues (e.g., inconsistent schemas), suggest cleanups, and generate SQL or no-code visual pipelines to transform data.
- Development (SQL, Python, Spark): Assistance for generating, completing, explaining, and translating SQL and Python code (including for BigQuery DataFrames and Serverless Spark in BigQuery Studio). The acceptance rates for generated code are high.
- Exploration and Analysis: Data Canvas allows for visual and interactive data exploration via natural language prompts. Data Insights automatically generates relevant questions and SQL queries on tables and datasets. A conversational agent (API in preview) allows business users to query data in natural language (via Looker or custom applications).
- AI Query Engine (Experimental): A major innovation allowing the combination of SQL and natural language prompts in a single query to jointly analyze structured and unstructured data at scale, integrating LLM calls directly into the BigQuery execution plan.
- AI-Assisted Governance: Automatic metadata generation, anomaly detection, visualization of relationships between entities (Knowledge Engine).
- Data Agents: Google is developing a family of specialized agents (Data Engineering, Data Science, Data Governance, Conversational Analytics) to automate complex tasks, orchestrate workflows, and act proactively based on user-defined objectives, all founded on unified governance. Data Engineering Agents (experimental) aim to automate the creation, modification, troubleshooting, and optimization of data pipelines via natural language or CLI/API.
BigQuery ML
BQML Continues to evolve for predictive and generative ML directly in SQL.
- Support for various models (Gemini, Claude, Llama, Hugging Face) via remote endpoints.
- Support for multimodal data.
- Row-wise functions and structured output (JSON or columns) for LLMs, simplifying parsing.
- Vector Search (GA with SCAN index for performance/cost, Vertex AI Search integration, support for partitioned tables).
- Contribution Analysis (GA) to identify the causes of metric variations.
- TimesFM (preview): Foundation model for “one-shot” time series forecasting in SQL.
It’s used by Mattel to successfully classify customer feedback (text), replacing manual and costly processes. Home Depot uses it for predictive (e.g., lead scoring) and generative (entity extraction) cases.
Advanced Analytics
- Apache Spark Integration: Serverless Spark is now integrated into BigQuery Studio (preview), allowing PySpark code execution in notebooks without cluster management, with unified access to BigQuery data and metadata. Access to BQ data from Spark via the Storage Read API is included in the Spark cost. Gemini also assists with Spark code. Trivago testifies to the simplification brought by this integration.
- BigQuery DataFrames (BigFrames): Open-source library (v2.0 announced) offering a Pandas/Scikit-learn-like API that transpiles Python code into BigQuery SQL/BQML, allowing processing of terabytes of data from a notebook without moving the data. Now supports complex types (Array, Struct, JSON), partial ordering for performance, managed Python UDFs (preview), DBT integration, and soon multimodal data and the AI Query Engine. Deutsche Telekom uses it to modernize its ML platform.
- Graph Analytics (Preview): BigQuery introduces the capability to model and query relational data as graphs using the GQL language (ISO standard), directly on existing data, without ETL to a dedicated graph database. Allows discovery of hidden relationships (fraud, recommendations, social networks, drug discovery). Integration with vector search and Spanner Graph (for transactional cases). Visualization in notebooks and via partners. Bio Cortex uses it for drug discovery by analyzing immense graphs of chemical reactions.
- Multimodal Data: Beyond Object Tables, Object Refs (preview) are introduced as a new column type in native BigQuery tables. They allow storing secure references (URI, version, authorizer) to unstructured objects (images, audio, video, documents) in GCS, directly alongside structured data. This simplifies multimodal analysis via BQML, Python UDFs, and the AI Query Engine, while benefiting from BigQuery governance (column/row-level access control).
Geospatial Data
- Enhanced integration with Google Earth Engine (raster data) and Google Maps Platform.
- New ST_REGIONSTATS function to extract statistics from raster images (Earth Engine or COG on GCS) for defined geographic areas, directly in SQL.
- Access to Earth Engine datasets via Analytics Hub.
- Places Insights (preview): Access to rich Google Maps POI data via Data Clean Rooms in BigQuery for location analysis (e.g., retail site selection).
- Imagery Insights (preview): Access to insights derived from Street View via AI (e.g., asset inventory for utilities, telcos).
- Roads Management Insights (preview): Road data for traffic analysis, safety, etc.
- Simplified Streaming: The Pub/Sub to BigQuery architecture is highlighted as a simple and scalable way to ingest and process real-time data. BigQuery subscriptions (including for Iceberg tables) facilitate no-code ingestion. Pub/Sub introduces Import Topics (GA) to consolidate streams from cross-cloud Kafka (MSK, Event Hubs, Confluent Cloud) and Single Message Transforms (JavaScript UDFs, soon) for lightweight native transformations.
- Continuous Queries (GA): Allow continuous execution of SQL queries on data as it arrives in BigQuery. Ideal for real-time analytics, continuous ML model training, and activation via export to destinations like Pub/Sub, Bigtable, Spanner (Reverse ETL). Supports calls to Vertex AI models (Gemini). Flipkart massively uses Pub/Sub and BigQuery for its very large-scale real-time pipelines (e-commerce, personalization).
- Bigtable and BigQuery: Strong synergy for real-time AI use cases. Bigtable serves as a low-latency feature store/cache for applications, fed by processing (batch or continuous via Continuous Queries) in BigQuery. Bigtable also integrates SQL (GA), Distributed Counters (GA), and Continuous Materialized Views (preview) for performant operational analytics. Zeotab uses this combination for its Customer Data Platform.
- Orchestration and Pipelines: BigQuery Pipelines (GA, based on Dataform) allows visual orchestration of SQL steps, Data Prep, and BigQuery notebooks (Spark or Python), with Git integration for CI/CD. Cloud Composer (managed Airflow) and Workflows remain key options.
- Unified Governance in BigQuery: Data Plex capabilities are integrated into the BigQuery experience. The BigQuery Universal Catalog centralizes metadata (technical, business, runtime) for all data types (structured, unstructured, AI) and engines. It powers semantic search, data insights, and AI agents.
- Quality and Security: Automatic anomaly detection (coming soon), data quality rules (via Data Prep or Data Plex), fine-grained access control (IAM, policy tags, row/column level, including on Object Refs), masking, lineage (including cross-engine and to Vertex AI), auditing, security (CMEK, VPC-SC), disaster recovery (DR) for BigQuery (GA, included in Enterprise Plus).
- Data Fabric / Data Mesh: BigQuery is presented as the technology enabling the construction of a data fabric (centralized and unified data management) that can support a data mesh organization (decentralized domain-based data ownership). Virgin Media O2 shares its experience building such a hybrid architecture.
- FinOps: The importance of cost tracking (FinOps) is emphasized, especially during migrations. BigQuery Spend Commit (GA) offers unified financial commitments across different engines (SQL, soon Spark, Composer).
- Why Migrate: Companies (Ford, Quest Diagnostics, PayPal, Intesa Sanpaolo, VMO2) migrate to BigQuery for scalability, performance, cost reduction, operational simplification (serverless), access to AI/ML capabilities, and to break free from on-premises silos or legacy cloud architectures.
- BigQuery Migration Service (BMS): Key free tool offering assessment (including TCO via “light” assessments), Gemini-assisted SQL translation (GA for batch/API, preview for pre-processing), data migration (new connectors for Cloudera, Snowflake in preview, compressed Teradata), validation, and soon source lineage and ETL migration (Informatica, DataStage).
- Best Practices: Phased approach preferred over “big bang”, modernization vs. lift-and-shift evaluated case-by-case, importance of governance and quality from the start, CI/CD, FinOps, change management, communication, inventory, planning for freeze periods and parallel runs, decommissioning.
In summary, Google Cloud positions BigQuery as a comprehensive, open, intelligent, and governed data-to-AI platform, capable of managing all types of data and workloads (SQL, Spark, ML, AI, streaming, batch) at scale, while simplifying the user experience through automation and Gemini assistance.
The deep integration with the open-source ecosystem (Iceberg, Spark) and other GCP services (Pub/Sub, Bigtable, Vertex AI, Data Plex) aims to offer a flexible and future-proof solution for the AI era. Numerous customer testimonials (Spotify, Mattel, Ford, PayPal, Flipkart, etc.) illustrate the adoption and success of this approach across various sectors.
Source Credit: https://medium.com/google-cloud/google-cloud-next25-bigquery-as-a-unified-data-and-ai-platform-5b2d59b23205?source=rss—-e52cf94d98af—4