Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

1.8k Views Asked by user2717470 At 20 October 2024 at 10:55

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements). However, next time when I try to read the same parquet table into Spark dataframe, the new column which was added to the parquet table using Hive Alter Table command is not showing up in the df.printSchema output.

Based on initial analysis, it seems that there might be some conflict, and Spark is using its own schema instead of reading the Hive metastore. Hence, I tried the below options :

Changing the spark setting: spark.sql.hive.convertMetastoreParquet=false and Refreshing the spark catalog: spark.catalog.refreshTable("table_name")

However, the above two options are not solving the problem.

Any suggestions or alternatives would be super helpful.

Original Q&A

There are 2 best solutions below

Maneesh K Bishnoi On 17 May 2022 at 09:55

To fix this solution, you have to use the same alter command used in hive to spark-shell as well.

spark.sql("alter table TABLE_NAME add COLUMNS (col_A string)")

mazaneicha On 29 June 2019 at 12:51

This sounds like a bug described in SPARK-21841. JIRA description also contains the idea for a possible workaround:

...Interestingly enough it appears that if you create the table differently like:

spark.sql("create table mydb.t1 select ip_address from mydb.test_table limit 1")

Run your alter table on mydb.t1 val t1 = spark.table("mydb.t1")

Then it works properly...

Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

There are 2 best solutions below

Related Questions in HADOOP

Related Questions in HIVE

Related Questions in PYSPARK

Related Questions in PARQUET

Related Questions in APACHE-SPARK-2.3

Trending Questions

Popular # Hahtags

Popular Questions