I'm using delta lake on using pyspark by submitting below command

spark-sql --packages io.delta:delta-core_2.12:0.8.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

System Specs:

  • Spark - 3.0.3
  • scala - 2.12.10
  • java - 1.8.0
  • hadoop - 2.7

i'm looking at reference blog https://docs.delta.io/latest/quick-start.html https://www.confessionsofadataguy.com/introduction-to-delta-lake-on-apache-spark-for-data-engineers/

but when I use spark without the delta config spark works fine.

Error (trucated the stacktrace):

22/12/29 20:38:10 INFO Executor: Starting executor ID driver on host 192.168.0.100
22/12/29 20:38:10 INFO Executor: Fetching spark://192.168.0.100:50920/jars/org.antlr_antlr4-runtime-4.7.jar with timestamp 1672326488255
22/12/29 20:38:23 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
22/12/29 20:38:23 INFO Executor: Told to re-register on heartbeat
22/12/29 20:38:23 INFO BlockManager: BlockManager null re-registering with master
22/12/29 20:38:23 INFO BlockManagerMaster: Registering BlockManager null
22/12/29 20:38:23 ERROR Inbox: Ignoring error

    java.lang.NullPointerException
            at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:404)
            at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:97)
            at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
            at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
            at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
            at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
            at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    22/12/29 20:38:23 WARN Executor: Issue communicating with driver in heartbeater
    org.apache.spark.SparkException: Exception thrown in awaitResult:
            at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
            at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
            at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)
            at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:66)
            at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:567)
            at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:934)
            at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:200)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
            at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.NullPointerException
            at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:404)
            at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:97)
            at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
            at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            ... 3 more
    22/12/29 20:38:31 ERROR Utils: Aborting task
    java.io.IOException: Failed to connect to /192.168.0.100:50920
            at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
            at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
            at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
            at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
            at scala.Option.getOrElse(Option.scala:189)
            at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931)
            at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:49)
            at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:321)  
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: /192.168.0.100:50920
    Caused by: java.net.ConnectException: Connection timed out: no further information
            at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
            at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
            at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
            at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
            at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
            at java.lang.Thread.run(Thread.java:745)
    22/12/29 20:38:31 ERROR SparkContext: Error initializing SparkContext.
    java.io.IOException: Failed to connect to /192.168.0.100:50920
            at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
            at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
            at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
            at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
            at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
            at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
            at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
            at java.lang.Thread.run(Thread.java:745)
    22/12/29 20:38:31 INFO SparkUI: Stopped Spark web UI at http://192.168.0.100:4041
    22/12/29 20:38:31 ERROR Utils: Uncaught exception in thread main
    java.lang.NullPointerException
            at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:168)
            at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
            at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:734)
            at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2171)
    java.lang.NullPointerException
            at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:304)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221)
            at org.apache.spark.executor.Executor.stop(Executor.scala:304)
            at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:74)
            at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
            at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
            at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
            at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
    22/12/29 20:38:31 INFO ShutdownHookManager: Shutdown hook called

what is being missed?

1

There are 1 best solutions below

4
On

Try it like this:

!pip3 install findspark --user
import findspark
findspark.init()

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta- core_2.12:2.2.0  --driver-memory 2g pyspark-shell'

spark = SparkSession.builder.appName("your application") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()