
Many data science teams rely on Apache Spark running on Dataproc managed clusters for powerful, large-scale data preparation. As these teams look to connect their data pipelines directly to machine learning models, there’s a clear opportunity to simplify the integration. But running inference on a Spark DataFrame using a model from Vertex AI typically requires custom development, making it complex to build a single, end-to-end workflow.
To solve this problem, we are developing a new open-source Python library designed to simplify AI/ML inference for Dataproc. This library connects your Apache Spark jobs to use popular ML frameworks and Vertex AI features, starting with model inference. Because the library is open-sourced, you will be able to use it directly in your application code with full transparency into its operation.
How it works
Dataproc ML is built to feel familiar to Spark users, following a SparkML-style builder pattern. You configure the model you want to use, and then call .transform() on your DataFrame. Let’s look at a few common inference use cases.
- Apply Gemini models to your Spark data
You can apply generative AI models, like Gemini, to columns in your Spark DataFrame. This is useful for tasks like classification, extraction, or summarization at scale. In this example, we take a DataFrame with “city” & “country” columns and use Gemini to create a new column by providing a simple prompt.
You can test in your local environment by installing from PyPi:
Source Credit: https://cloud.google.com/blog/products/data-analytics/gemini-and-vertex-ai-for-spark-with-dataproc-ml-library/