I am using Open source delta lake where I am maintaining Historical data. Lets say Master with key columns are (key, ID, partition-col, ....).
I join this master data set using (key, ID, partition-col) columns with Incoming records.
What has been done already:
- Ran Optimization on Delta lake table (Compaction)
Planning to implement below:
- Ran Optimization on Delta lake table (ZORDERING)
Here, I understand the concept of zordering where it colocate the data and help us on data skipping while reading the data. However, It is usesful when we have where condition in the spark Query.
My usecase:
val masterDf = Spark.read.format("delta").load("PATH")
val incomingDF= Spark.read.parquet("PATH")
val = joinDF= incomingDF.join(masterDf, Seq("key"."id","key_partion"))
Here Just want to understand how Zordering will help here for data skipping whereas Spark has to read full data from master into memory for join with incoming data?
Any Help is appreciated.