I just started using delta lake so my mental model might be off - I'm asking this question to validate/refute it.
My understanding of delta lake is that it only stores incremental changes to data (the "deltas"). Kind of like git - every time you make a commit, you're not storing an entire snapshot of the codebase - a commit only contains the changes you made. Similarly, I would imagine that if I create a Delta table and then I attempt to "update" the table with everything it already contains (i.e. an "empty commit") then I would not expect to see any new data created as a result of that update.
However this is not what I observe: such an update appears to duplicate the existing table. What's going on? That doesn't seem very "incremental" to me.
(I'll replace the actual UUID values in the filenames for readability)
# create the data
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(200))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
delta_base_path = "s3://my-delta-table/test"
(df.write.format("delta")
.mode("overwrite")
.save(delta_base_path))
!aws s3 ls --recursive s3://my-delta-table/test
2020-10-19 14:33:21 1622 test/_delta_log/00000000000000000000.json
2020-10-19 14:33:18 0 test/_delta_log_$folder$
2020-10-19 14:33:19 10790 test/part-00000-UUID1-c000.snappy.parquet
2020-10-19 14:33:19 10795 test/part-00001-UUID2-c000.snappy.parquet
Then I overwrite using the exact same dataset:
(df.write.format("delta")
.mode("overwrite")
.save(delta_base_path))
!aws s3 ls --recursive s3://my-delta-table/test
2020-10-19 14:53:02 1620 test/_delta_log/00000000000000000000.json
2020-10-19 14:53:08 872 test/_delta_log/00000000000000000001.json
2020-10-19 14:53:01 0 test/_delta_log_$folder$
2020-10-19 14:53:07 6906 test/part-00000-UUID1-c000.snappy.parquet
2020-10-19 14:53:01 6906 test/part-00000-UUID3-c000.snappy.parquet
2020-10-19 14:53:07 6956 test/part-00001-UUID2-c000.snappy.parquet
2020-10-19 14:53:01 6956 test/part-00001-UUID4-c000.snappy.parquet
Now there are TWO part-0
and part-1
s, of the exact same size. Why doesn't Delta dedupe the existing data?
it's not completely correct understanding - Delta won't check the existing data for duplicates automatically - if you want to store only new/updated data, then you need to use
merge
operation that will check for existing data, and then you can decide what you'll do with existing data - overwrite with new data, or just ignore that.You may find more information on the Delta's site, or in the 9th chapter of the Learning Spark, 2ed book (it's freely available from Databricks)