Errors while installing Spline (Data Lineage Tool for Spark)

719 Views Asked by At

I am trying to install Apache Spline in Windows. My Spark version is 2.4.0 Scala version is 2.12.0 I am following the steps mentioned here https://absaoss.github.io/spline/ I ran the docker-compose command and the UI is up

wget https://raw.githubusercontent.com/AbsaOSS/spline/release/0.5/docker-compose.yml
docker-compose up

After that I tried to run the below command to start the pyspark shell

    pyspark \
  --packages za.co.absa.spline.agent.spark:spark-2.4-spline-agent-bundle_2.12:0.5.3 \
  --conf "spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener" \
  --conf "spark.spline.producer.url=http://localhost:9090/producer"

This is giving me the following error

    C:\Users\AyanBiswas\Documents\softwares\spark-2.4.0-bin-hadoop2.7\python\pyspark\shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "C:\Users\AyanBiswas\Documents\softwares\spark-2.4.0-bin-hadoop2.7\python\pyspark\shell.py", line 41, in <module>
    spark = SparkSession._create_shell_session()
  File "C:\Users\AyanBiswas\Documents\softwares\spark-2.4.0-bin-hadoop2.7\python\pyspark\sql\session.py", line 583, in _create_shell_session
    return SparkSession.builder.getOrCreate()
  File "C:\Users\AyanBiswas\Documents\softwares\spark-2.4.0-bin-hadoop2.7\python\pyspark\sql\session.py", line 183, in getOrCreate
    session._jsparkSession.sessionState().conf().setConfString(key, value)
  File "C:\Users\AyanBiswas\Documents\softwares\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\AyanBiswas\Documents\softwares\spark-2.4.0-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\AyanBiswas\Documents\softwares\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sessionState.
: java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V

I tried to check what might be the cause of this error and most the posts point to scala version mismatch , but I am using scala 2.12.0 and spline package mentioned is also for scala 2.12. So , what am I missing ?

2

There are 2 best solutions below

1
On BEST ANSWER

I resolved the error by using spark 2.4.2 with Scala 2.12.10. The reason is

  • All spark 2.x versions are build using scala 2.11
  • Only spark 2.4.2 is built using scala 2.12

This is mentioned on spark download page here

Note that, Spark 2.x is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12. Spark 3.0+ is pre-built with Scala 2.12.

0
On

I would try to update your Scala and Spark version to never minor versions. Spline interally uses Spark 2.4.2 and Scala 2.12.10. So I would go for that. But I am not sure if this is cause of the problem.