Dataproc features enable AI/ML-ready Apache Spark

Apache Spark is the cornerstone for large-scale data processing, model training, and inference for AI/ML workloads. Yet, the complexities of environment configuration, dependency management, and MLOps integration can slow you down. To accelerate your AI/ML journey, Dataproc now delivers powerful, ML-ready capabilities for Spark. Available on both Dataproc on Compute Engine clusters and Google Cloud Serverless for Apache Spark, these enhancements are engineered to streamline development and operations, reducing setup overhead and simplifying workflows. This allows data scientists and engineers to dedicate more time to building and deploying impactful models rather than wrestling with infrastructure.

Let’s explore what’s new and how to start using these innovations today.

AI/ML-capable runtimes

Getting a Spark environment ready for ML, especially with GPU acceleration, used to involve custom scripts and manual configuration. Dataproc now streamlines this with ML Runtimes. ML Runtimes is a specialized Dataproc on Compute Engine image version, starting from 2.3 for Ubuntu-based images, designed to accelerate ML workloads. It ships with pre-packaged GPU drivers (NVIDIA Driver, CUDA, cuDNN, NCCL) and common ML libraries such as PyTorch, XGBoost, tokenizers, transformers etc, significantly cutting down cluster provisioning and setup time.

Google Cloud Serverless for Apache Spark also benefits from runtimes with pre-installed ML libraries, bringing the same ease of use to a serverless environment. These also include libraries such as XGBoost, PyTorch, tokenizers, transformers, etc.

“At Snap we use Spark on Dataproc for a variety of analytics and ML workloads including running GPU accelerated Spark Rapids, and model training and inference with PyTorch. The new Dataproc 2.3 ML runtime has been really helpful — reducing our cluster startup latency by 75% and eliminating toil for our ML Platform developers to build and manage environments.”– Prudhvi Vatala, Sr. Manager, Snap Inc.

It’s easy to create a Dataproc on a compute Engine cluster, specifying the ML image version and the required GPU accelerators for your workers.

Source Credit: https://cloud.google.com/blog/products/data-analytics/dataproc-features-enable-aiml-ready-apache-spark/