Data in hive table is changed after running a compaction in pyspark

130 Views Asked by Liran Eliyahu At 16 July 2023 at 13:30

Following previously asked question adding link.

in short: I wrote a file compactor in spark, the way that it works is by reading all files under a directory into a dataframe, performing coalesce over the dataframe (by the number of wanted files), writing them back into their directory and then compressing them into snappy.

The problem I have: The directories I'm compacting are actually partitions under a table in Apache hive, after rewriting back the files into their directory and performing a basic select query over the partition in hive it seems that the data is being altered, for example:

This table:

Column A	Column B
1	null
null	1

Turns into:

Column A	Column B
1	null
1	null

can someone please help me understand why does the data is being altered and how can i fix it?

Original Q&A

There are 2 best solutions below

Ariel Grosh On 16 July 2023 at 13:39

sounds like your problem is in the coalesce directive, when it unifies the data it replaces it and can lead to inconsistencies

Indrajit Swain On 16 July 2023 at 15:36

Pyspark with hive portioned table compaction can be performed using the below code .
Make sure your mentioning the partition columns

table_name = "your_table_name"

partition_columns = ["partition_col1", "partition_col2"]

spark.sql(f"MSCK REPAIR TABLE {DB_name}.{table_name}")

spark.sql(f"ANALYZE TABLE {table_name} COMPUTE STATISTICS")

Check the statistics of query is completely optional .

Data in hive table is changed after running a compaction in pyspark

There are 2 best solutions below

Related Questions in APACHE-SPARK

Related Questions in HADOOP

Related Questions in PYSPARK

Related Questions in HIVE

Related Questions in DATA-COMPACTION

Trending Questions

Popular # Hahtags

Popular Questions