
In this post I explain how to add pip libraries and python modules into Dataproc Spark Serverless jobs.
How you install extra python pip libraries?
The solution is custom containers. The example Dockerfile in the official documentation creates a full image which takes around 20 minutes. I prepared a concise version for you which creates the image in 2 minutes. As you see it is based on condaforge base image. It installs python 3.11 environment as an example to use a different python version (spark serverless runtimes have 3.12), and then runs pip commands to install dependencies.
FROM condaforge/miniforge3:latest# Suppress interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
# Install utilities required by Spark scripts.
RUN apt update && apt install -y --no-install-recommends procps tini libjemalloc2 && rm -rf /var/lib/apt/lists/*
# Enable jemalloc2 as default memory allocator
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
# Install and configure Miniconda3.
ENV CONDA_HOME=${CONDA_DIR}
ENV CONDA_ENV_NAME=spark_env
ENV PYSPARK_PYTHON=${CONDA_HOME}/envs/${CONDA_ENV_NAME}/bin/python
ENV PATH=${CONDA_HOME}/envs/${CONDA_ENV_NAME}/bin:${CONDA_HOME}/bin:${PATH}
# Create a new Conda environment with Python 3.11.
RUN ${CONDA_HOME}/bin/mamba create -n ${CONDA_ENV_NAME} -y python=3.11 pandas numpy scikit-learn fastavro fastparquet gcsfs pyarrow
# Install pip packages into the new environment.
RUN ${PYSPARK_PYTHON} -m pip install dynaconf~="3.2.10" \
python-dateutil~="2.9" \
"confluent-kafka[avro]~=2.6.0"
# If you want to embed some custom Jars and Modules (although it is better to provide at runtime)
# # Add extra Python modules.
# ENV PYTHONPATH=/opt/python/packages
# RUN mkdir -p "${PYTHONPATH}"
# # Add extra jars.
# ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
# ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
# RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}"
#Uncomment below and replace EXTRA_JAR_NAME with the jar file name.
#COPY "EXTRA_JAR_NAME" "${SPARK_EXTRA_JARS_DIR}"
# Create the 'spark' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark
# You may create the image by using google cloud build
# Run this code inside the folder where the above Dockerfile exists
gcloud builds submit \
--gcs-log-dir gs://BUCKET_NAME/cloud_build/logs \
--service-account projects/PROJECT_ID/serviceAccounts/SA-NAME@PROJECT_ID.iam.gserviceaccount.com \
--region=europe-west4 \
--tag europe-west4-docker.pkg.dev/PROJECT_ID/REGISTRY_NAME/IMAGE_NAME:TAG \
--project=PROJECT_ID
How to add your custom Python Modules?
You may have extra code, not just a single PySpark script. Then how are you going to add all those modules. One option is embedding into the image, but a better way is to use py-files option.
# First Zip all the python modules that you want to add
# the below modules are actually folders
zip -q -r spark_project.zip my_extra_module1 my_extra_module2# a uuid is generated so that we don't have conflict on job names
SHORT_ID=$(uuidgen | cut -d'-' -f1 | tr '[:upper:]' '[:lower:]')
gcloud dataproc batches submit pyspark \
--project="PROJECT_ID" \
--region="REGION" \
--batch="job-name-${SHORT_ID}" \
--staging-bucket="PROJECT_ID-dataproc-utility-staging" \
--deps-bucket="gs://PROJECT_ID-dataproc-utility-temp" \
--version=2.3 \
--container-image="europe-west4-docker.pkg.dev/PROJECT_ID/REGISTRY_NAME/IMAGE_NAME:TAG" \
--service-account="PROJECT_ID-dp-sa@PROJECT_ID.iam.gserviceaccount.com" \
--properties="dataproc.sparkBqConnector.version=0.42.0" \
--py-files="spark_project.zip" \
--files="settings.toml" \
--labels="job_name=test-modules,env=dev" \
run.py \
-- \
--arg1="1" \
--arg2="A"
In the above code piece, our main python file is run.py. The python modules are inside spark_project.zip, and a settings file is settings.toml. Python can read modules that are inside a zip file. So uploading directly the zip is fine.
Spark Serverless uploads your files in a directory structure like below.
/..../[UUID]
|run.py
|spark_project.zip
|settings.toml
As you see all of them sit together side by side. So you may easily import your modules as if they are in the same location like:
import my_extra_module1
from my_extra_module2 import important_function
Enjoy!
Source Credit: https://medium.com/google-cloud/how-to-configure-your-dataproc-pyspark-serverless-jobs-5ab963a89dd5?source=rss—-e52cf94d98af—4