I am trying to connect to a mssql server via jdbc with pyspark in dataproc.
I am getting an error py4j.protocol.Py4JJavaError: An error occurred while calling o79.jdbc. : java.lang.ClassNotFoundException: mssql-jdbc-12.4.0.jre11.jar
The main file (main.py
):
spark = SparkSession.builder.appName('my_app').getOrCreate()
connection_string = f'jdbc:sqlserver://1.2.3.4:1433;databaseName=my_db;'
properties = { 'user':'my_user', 'password':'my_password' }
df = spark.read.jdbc(
url=connection_string,
table='my_table',
properties=properties
)
The gcloud
command:
gcloud dataproc batches submit pyspark \
--batch my_batch main.py \
--jars mssql-jdbc-12.4.0.jre11.jar \
--properties driver=mssql-jdbc-12.4.0.jre11.jar