I'm trying to print rows from my Spark dataframe in Amazon Sagemaker. I have created a Spark dataframe by reading a table from a Redshift database. Printing the full table alone shows the column names and types. However, trying to show the actual content of the table results into an error.
I used the following code to connect to the redshift database and read the table.
# Spark session
spark = SparkSession.builder \
.appName("Redshift Connection with PySpark") \
.config("spark.driver.extraClassPath", "/path_name/Drivers/redshift-jdbc42-2.1.0.17.jar") \
.getOrCreate()
# Redshift properties
redshift_url = "jdbc:redshift://{host}:{port}/{database}".format(
host="redshift_host.redshift.amazonaws.com",
port="1234",
database="databasename"
)
redshift_properties = {
"user": user,
"password": pw,
"driver": "com.amazon.redshift.jdbc42.Driver"
}
# Read query
query="SELECT * FROM schema.table_name"
df = spark.read.jdbc(redshift_url, query, properties=redshift_properties)
All column names and types are read correctly:
df
>>> DataFrame[column_1: string, column_2: timestamp, column_3: bigint]
However when trying to show rows it gives the following error:
df.show()
>>>
Py4JJavaError: An error occurred while calling o378.showString.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)