Installing python packages in Serverless Dataproc GCP

2.9k Views Asked by At

I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.

1

There are 1 best solutions below

3
On

You have two options:

  1. Using command gcloud in terminal:

You can create a custom image with dependencies(python packages) in the GCR(Google Container Registry GCP) and add uri as parameter in the command below:
e.g.

$ gcloud beta dataproc batches submit
--container-image=gcr.io/my-project-id/my-image:1.0.1
--project=my-project-id --region=us-central1
--jars=file:///usr/lib/spark/external/spark-avro.jar
--subnet=projects/my-project-id/regions/us-central1/subnetworks/my- subnet-name

To create custom container image for Dataproc Serveless for Spark.

  1. Using operator DataprocCreateBatchOperator of airflow:

Add to python-file the script below, it will install the desired package and then load this package into the container path (dataproc servless), this file must be saved in a bucket, this uses the secret manager package as an example.

python-file.py

import pip
import importlib
from warnings import warn
from dataclasses import dataclass

def load_package(package, path):

    warn("Update path order. Watch out for importing errors!")
    if path not in sys.path:
        sys.path.insert(0,path)

    module = importlib.import_module(package)
    return importlib.reload(module)

@dataclass
class PackageInfo:
    import_path: str
    pip_id: str

packages = [PackageInfo("google.cloud.secretmanager","google-cloud-secret-manager==2.4.0")]
path = '/tmp/python_packages'
pip.main(['install', '-t', path, *[package.pip_id for package in packages]])

for package in packages:
    load_package(package.import_path, path=path)      

...   

finally the perator calls the python-file.py

create_batch = DataprocCreateBatchOperator( task_id="batch_create",
batch={ "pyspark_batch": { "main_python_file_uri": "gs://bucket-name/python-file.py", "args": [ "value1", "value2" ], "jar_file_uris": "gs://bucket-name/jar-file.jar", },
"environment_config": { "execution_config": { "subnetwork_uri": "projects/my-project-id/regions/us-central1/subnetworks/my-subnet-name" },
}, }, batch_id="batch-create", )