Introducing Google Cloud Serverless for Apache Spark in BigQuery

A unified Spark and BigQuery experience

Building on the power of serverless Spark, we’ve worked to reimagine how you work with Spark and BigQuery, so that you can get the flexibility to use the right engine for the right job, with a unified platform, notebook interface, and on a single copy of data.

With the general availability of serverless Apache Spark in BigQuery, we’re bringing Apache Spark directly into the BigQuery unified data platform. This means you can now develop, run and deploy Spark code interactively in the BigQuery Studio, offering an alternative, scalable, OSS processing framework alongside BigQuery’s renowned SQL engine.

“We rely on machine learning for connecting our customers with the greatest travel experiences at the best prices. With Google Serverless for Apache Spark, our platform engineers save countless hours configuring, optimizing, and monitoring Spark clusters, while our data scientists can now spend their time on true value-added work like building new business logic. We can seamlessly interoperate between engines and use BigQuery, Spark and Vertex AI capabilities for our AI/ML workflows. The unified developer experience across Spark and BigQuery, with built-in support for popular OSS libraries like PyTorch, Tensorflow, Transforms etc., greatly reduces toil and allows us to iterate quickly.” – Andrés Sopeña Pérez, Head of Content Engineering, trivago

Key capabilities and benefits of Spark in BigQuery

Apart from all the features and benefits of Google Cloud Serverless for Apache Spark outlined above, Spark in BigQuery offers deep unification:

Unified developer experience in BigQuery Studio:

Develop SQL and Spark code side-by-side in BigQuery Studio notebooks.
Leverage Gemini-based PySpark Code Generation (Preview), with the intelligent context of your data to prevent hallucination in generated code.
Use Spark Connect for remote connectivity to serverless Spark sessions.
Because Spark permissions are unified with default BigQuery roles, you can get started without needing additional permissions.

Unified data access and engine interoperability:

Powered by the BigLake metastore, Spark and BigQuery can operate on a single copy of your data, whether it’s BigQuery managed tables or open formats like Apache Iceberg. No more juggling separate security policies or data governance models across engines. Refer to the documentation on using BigLake metastore with Spark.
Additionally, all data access to BigQuery, both native and OSS formats, are unified via the BigQuery Storage Read API. Reads from serverless Spark jobs via the Storage API are now available at no additional cost

Source Credit: https://cloud.google.com/blog/products/data-analytics/introducing-google-cloud-serverless-for-apache-spark-in-bigquery/