Persist and unpersist in Spark while using same variable during the entire ETL

270 Views Asked by At

How will persist() and unpersist() work if all steps of my etl process would have the same variable name? e.g:

df =  new dataframe created by reading json for instance i dunno
df.persist()
df = df.withColumn(some_transformation)
df = df.another_transofrmation
df = df.probably_even_more_transformations
df.unpersist()
  1. What exactly will be persisted and unpersisted?
  2. Are persist() and unpersist() methods actions or transformations?
  3. What is better spark.catalog.clearCache() or unpersist() after every persist()

thank you kindly beforehand

1

There are 1 best solutions below

0
VonC On BEST ANSWER

In Spark, persist and unpersist are used to manage memory and disk usage when working with DataFrames.

Your scenario:

+-------------------+    +-------------------+    +-------------------------------+
|                   |    |                   |    |                               |
| Read JSON to df   | -> | df.persist()      | -> | Transformations on df         |
|                   |    | (Stores df in     |    | (df keeps getting reassigned) |
|                   |    | memory/disk)      |    |                               |
+-------------------+    +-------------------+    +-------------------------------+
                                   |                                 |
                                   V                                 V
                          +-------------------+         +--------------------------------+
                          |                   |         |                                |
                          | Data is persisted |         | Reassignments do not affect    |
                          | during all these  |         | persisted data until unpersist |
                          | transformations   |         | is called                      |
                          |                   |         |                                |
                          +-------------------+         +--------------------------------+
                                                                      |
                                                                      V
                                                            +--------------------+
                                                            |                    |
                                                            | df.unpersist()     |
                                                            | (Frees up storage) |
                                                            |                    |
                                                            +--------------------+

So:

  1. What will be persisted and unpersisted?

    The DataFrame df is persisted after the persist() method is called. All transformations applied to df after this point do not affect the persisted data until unpersist() is called. When you reassign df with transformations, the original persisted DataFrame remains in memory/disk until unpersist().

  2. Are persist() and unpersist() methods actions or transformations?

    persist() and unpersist() are not actions or transformations in the typical Spark sense. They are methods to manage storage. persist() hints to Spark to keep the DataFrame in memory or disk, and unpersist() frees up the storage. They do not trigger computations like actions, and do not create new DataFrames like transformations.

  3. What is better spark.catalog.clearCache() or unpersist() after every persist()?

    • unpersist(): Specifically removes the persistence of the DataFrame it is called on. It is more controlled and precise when you know which DataFrame you no longer need.
    • spark.catalog.clearCache(): Clears all cached DataFrames in the Spark session. It is more of a global reset, useful when you want to make sure all cached data is freed, but it is less precise.

In your ETL process, it is more typical to use unpersist() when you are done with a specific DataFrame to free up resources. clearCache() can be used at the end of the process or when you would like to make sure a clean slate regarding all cached data.

See also "Spark – Difference between Cache and Persist?" from Naveen Nelamali, Oct. 2023

Both cache() and persist() are optimization techniques that store intermediate computations of RDDs, DataFrames, and Datasets so they can be reused in subsequent actions. The key difference is that cache() defaults to storing data in memory (MEMORY_ONLY), while persist() allows specifying the storage level, including MEMORY_AND_DISK, which is the default for DataFrames and Datasets.

  • The persist() method can be used without arguments, defaulting to the MEMORY_AND_DISK storage level, or with a StorageLevel argument to specify different storage levels like MEMORY_ONLY, MEMORY_AND_DISK_SER, etc…

  • The unpersist() method is used to remove the persisted DataFrame or Dataset from memory or storage. It has a boolean argument that, when set, blocks the operation until all blocks are deleted.

Spark automatically monitors persist() and cache() calls and may drop persisted data if not used, following a least-recently-used (LRU) algorithm. It is noted that caching in Spark is a lazy operation, meaning the DataFrame or Dataset will not be cached until an action triggers it.

In the context of your ETL process, where the same variable name is used throughout, the persist() and unpersist() methods control the storage and release of the DataFrame at different stages of the process.
The DataFrame is kept in the specified storage level (default MEMORY_AND_DISK for DataFrames) during transformations until unpersist() is explicitly called.
That allows for optimized performance by avoiding recomputation of the DataFrame at each transformation stage.