Spark: Performance Improvements with Delta lake with Zorder

31 Views Asked by Ankit90 At 09 January 2024 at 14:18

I am using Open source delta lake where I am maintaining Historical data. Lets say Master with key columns are (key, ID, partition-col, ....).

I join this master data set using (key, ID, partition-col) columns with Incoming records.

What has been done already:

Ran Optimization on Delta lake table (Compaction)

Planning to implement below:

Ran Optimization on Delta lake table (ZORDERING)

Here, I understand the concept of zordering where it colocate the data and help us on data skipping while reading the data. However, It is usesful when we have where condition in the spark Query.

My usecase:

val masterDf = Spark.read.format("delta").load("PATH")

val incomingDF= Spark.read.parquet("PATH")

val = joinDF= incomingDF.join(masterDf, Seq("key"."id","key_partion"))

Here Just want to understand how Zordering will help here for data skipping whereas Spark has to read full data from master into memory for join with incoming data?

Any Help is appreciated.

Original Q&A

Spark: Performance Improvements with Delta lake with Zorder

There are 0 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in OPTIMIZATION

Related Questions in DELTA-LAKE

Related Questions in DATA-COMPACTION

Trending Questions

Popular # Hahtags

Popular Questions