Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

1.8k Views Asked by At

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements). However, next time when I try to read the same parquet table into Spark dataframe, the new column which was added to the parquet table using Hive Alter Table command is not showing up in the df.printSchema output.

Based on initial analysis, it seems that there might be some conflict, and Spark is using its own schema instead of reading the Hive metastore. Hence, I tried the below options :

Changing the spark setting: spark.sql.hive.convertMetastoreParquet=false and Refreshing the spark catalog: spark.catalog.refreshTable("table_name")

However, the above two options are not solving the problem.

Any suggestions or alternatives would be super helpful.

2

There are 2 best solutions below

0
On

To fix this solution, you have to use the same alter command used in hive to spark-shell as well.

spark.sql("alter table TABLE_NAME add COLUMNS (col_A string)")
5
On

This sounds like a bug described in SPARK-21841. JIRA description also contains the idea for a possible workaround:

...Interestingly enough it appears that if you create the table differently like:

spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 1")

Run your alter table on mydb.t1 val t1 = spark.table("mydb.t1")

Then it works properly...