Installing python packages on HDInsight on-demand cluster via Azure DataFactory ADF's spark activity

102 Views Asked by At

How to install packages which are not available when a Spark script is written and run using on-demand HDinsight cluster via Azure Data Factory ADF ?

There is an old question here but it wasn't answered. Custom script action in Azure Data Factory HDInsight Cluster

How to do a pip install inside my pyspark script ? or any other way?

My pyspark script is running on the on-demand hdinsight cluster via ADF loading data from csv blob to Azure MySQL [for a proof of concept scenario, so have to stick with hdinsight only for now, no databricks]

1

There are 1 best solutions below

1
DileeprajnarayanThumula On

You can include the pip install command in your PySpark script:

!pip install azure-cosmos

enter image description here

  • The above code allows you to install the required packages on the on-demand HDInsight cluster before running your PySpark script via ADF.

You can also use the Custom Script Action feature in HDInsight to run a script that installs required Python packages before executing your Spark script.

As per the MS documentation Customize Azure HDInsight clusters by using script actions.

  • The Bash script URI (the location to access the file) has to be accessible from the HDInsight resource provider and the cluster.

Learn more about Example script action scripts, Permissions, Access control, and Script action during cluster creation.

  • HDInsight Spark cluster has two built-in Python installations: Anaconda Python 2.7 and Anaconda Python 3.5.

Learn more about Python packages for the cluster.

When you are creating the HD INSIGHT Cluster you can add script actions this will allow the invoke custom scripts to customize the cluster. These scripts are used to install additional components and change configuration settings

Know more about the Script action during cluster creation