Context
I have an operation that should be performed on some tables using pyspark. This operation includes accessing the Spark metastore (in Databricks) to get some metadata. Since I have plenty of tables I'm parallelizing this operation among the cluster workers with an RDD, as you can see in the code below:
base_spark_context = SparkContext.getOrCreate()
rdd = base_spark_context.sc.parallelize(tables_list)
rdd.map(lambda table_name: sync_table(table_name)).collect()
The UDF sync_table()
run queries on the metastore, similar to this code line:
spark_client.session.sql("select 1")
Problem The problem is that this SQL execution not succeeds. Rather I get some metastore related error. Traceback:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
(suppressed lines)
Caused by: java.lang.reflect.InvocationTargetException
(suppressed lines)
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader@16c0663d, see the next exception for details.
(suppressed lines)
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /databricks/spark/work/app-20210413201900-0000/0/metastore_db.
Is there any limitation accessing the Databricks metastore within a worker, after parallelizing the operation in such a way? Or there is a possibility of performing such an operation?