After upgrading to Glue 3.0 I got the following error when handling rdd objects

An error occurred while calling o926.javaToPython. You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

I've already added the config mentioned in the doc

--conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED

this is really a blocking issue that prevents to run the Glue jobs !

Note: locally I'm using pyspark3.1.2, for the same data it works with no problem

3

There are 3 best solutions below

0
On

I faced the same issue by following the aws doc, since general Glue recommendation is that we should not setup and use --conf parameter as it is used internally. My solution involved following:

from pyspark import SparkConf
conf = SparkConf()

conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

The problem I faced using Mauricio's answer was that sc.stop() actually stops the execution on spark context using Glue 3.0, and disrupts the stream of data I was ingesting from the data source (RDS in my case).

0
On

Setting SparkContext did not work for me, I had to set it in spark_session.

sc = SparkContext.getOrCreate() # version 3.1.1-amzn-0
conf = sc.getConf()
#NO#NO#NO#NO#NO#NO#NO#NO#NO#NO setting SparkContext does not work. 
# conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") # SPARK-31404   NOPE

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES setting spark_session does work. 
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") #SPARK-31404
print(spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")) #see?

This may depend on how you are using sc & spark, I was querying with:

df = spark.read.format("xml").etc   
0
On

I solved like this. Default below:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

Add additional spark configurations

conf = sc.getConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)

... your code