spark ETL and spark thrift server

Some details:

  • Spark SQL (version 3.2.1)
  • Driver: Hive JDBC (version 2.3.9)

ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads

BI tool is connect via odbc driver

After activating Spark Thrift Server I'm unable to run pyspark script using spark-submit as they both use the same metastore_db

Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@3acaa384, see the next exception for details.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
        ... 140 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /tmp/metastore_db.

I need to be able to run PySpark (Spark ETL) while having spark thrift server up for BI tool queries. Any workaround for it?



In my case the solution was to move the metastore_db to a database server like MySql (in my case) or Postgresql.

You will have to configure $SPARK_HOME/conf/hive-site.xml and include your jdbc driver in $SPARK_HOME/jars path

hive-site.xml example for MySQL connection

  <!-- Hive Execution Parameters -->

