Apache Hudi Size Requirements

31 Views Asked by At

Does apache hudi store the changes within file, or just of whole files? In other words, does it:

  1. Record a deletion of the previous table and an addition of a whole new, modified table? OR
  2. Does it record insert/update/deletes on the row level?

Suppose you have a 100MiB table and you change a single row representing 1KiB of data, and you make 100 such changes. Will it take up approximately 100 * 1KiB of space or 100 * 100MiB?

This may depend on the engine, so an answer that differs by engine is acceptable.

1

There are 1 best solutions below

2
On

When using cow table layout you end up rewriting the whole parquet file for each modification. So in the case of a 100mb table with only one parquet file for 100 modifications you end up rewriting 100mb 100 times. But there are cleaning process to cleanup versions of that file.

When using mor table layout , you would keep the 100mb parquet file and happen 100 avro files of few KB . here against there is gotcha since the compaction would merge the avro ane the parquet from time to time into a new parquet files.

EDIT: while some people keep the whole hudi timeline without cleaning or archiving (and as a result the whole parquet history), this lead to serious performances problem when your timeline has large number of commits. Also the write amplification leads to high storage costs. So hudi is not designed for such use case.