Spark Error: Could not initialize class org.apache.spark.rdd.RDDOperationScope$

56 Views Asked by At

I'm trying to print rows from my Spark dataframe in Amazon Sagemaker. I have created a Spark dataframe by reading a table from a Redshift database. Printing the full table alone shows the column names and types. However, trying to show the actual content of the table results into an error.

I used the following code to connect to the redshift database and read the table.

# Spark session
spark = SparkSession.builder \
    .appName("Redshift Connection with PySpark") \
    .config("spark.driver.extraClassPath", "/path_name/Drivers/redshift-jdbc42-2.1.0.17.jar") \
    .getOrCreate()

# Redshift properties
redshift_url = "jdbc:redshift://{host}:{port}/{database}".format(
    host="redshift_host.redshift.amazonaws.com",
    port="1234",
    database="databasename"
)

redshift_properties = {
    "user": user,
    "password": pw,
    "driver": "com.amazon.redshift.jdbc42.Driver"
}

# Read query
query="SELECT * FROM schema.table_name"
df = spark.read.jdbc(redshift_url, query, properties=redshift_properties)

All column names and types are read correctly:

df
>>> DataFrame[column_1: string, column_2: timestamp, column_3: bigint]

However when trying to show rows it gives the following error:

df.show()

>>>
Py4JJavaError: An error occurred while calling o378.showString.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
    at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)
0

There are 0 best solutions below