Following previously asked question adding link.
in short: I wrote a file compactor in spark, the way that it works is by reading all files under a directory into a dataframe, performing coalesce over the dataframe (by the number of wanted files), writing them back into their directory and then compressing them into snappy.
The problem I have: The directories I'm compacting are actually partitions under a table in Apache hive, after rewriting back the files into their directory and performing a basic select query over the partition in hive it seems that the data is being altered, for example:
This table:
| Column A | Column B |
|---|---|
| 1 | null |
| null | 1 |
Turns into:
| Column A | Column B |
|---|---|
| 1 | null |
| 1 | null |
can someone please help me understand why does the data is being altered and how can i fix it?
sounds like your problem is in the
coalescedirective, when it unifies the data it replaces it and can lead to inconsistencies