We use data type dependent logic in Spark 3.2. For interval year data type, DataFrame methods schema and dtypes don't seem to work.
Without interval year type column, the methods work well:
df1 = spark.range(1)
df1.printSchema()
# root
# |-- id: long (nullable = false)
print(df1.schema)
# StructType(List(StructField(id,LongType,false)))
print(df1.dtypes)
# [('id', 'bigint')]
But when I add a new column, schema and dtypes methods start to throw the parsing error:
df2 = df1.withColumn('col_interval_y', F.expr("INTERVAL '2021' YEAR"))
df2.printSchema()
# root
# |-- id: long (nullable = false)
# |-- col_interval_y: interval year (nullable = false)
print(df2.schema)
# ValueError: Unable to parse datatype from schema. Could not parse datatype: interval year
print(df2.dtypes)
# ValueError: Unable to parse datatype from schema. Could not parse datatype: interval year
For our logic to work, we need to access column data types of a dataframe. How can we access the type interval year in Spark 3.2? (Spark 3.5 doesn't throw errors, but we cannot use it yet)
I have found that it's possible to use underlying
_jdf.The following recreates the result of
dtypes: