I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it.
My questions are:
Should I specifiy the checkpoint dir? Is it possible to do it like this:
df.checkpoint()
Are there any optional params that I should be aware about?
Is there a default checkpoint dir or I must specify one as default?
When I checkpoint dataframe and I reuse it - It autmoatically read the data from the dir that we wrote the files?
It will be great if you can share with me example of using checkpoint in pyspark with some explanation. Thanks!
You should assign the checkpointed dataframe to a variable as
checkpoint"Returns a checkpointed version of this Dataset" (https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.checkpoint.html). SoThe only parameter is
eagerwhich dictates whether you want the checkpoint to trigger an action and be saved immediately, it isTrueby default and you usually want to keep it this way.You have to set the checkpoint directory with
SparkContext.setCheckpointDir(dirName)somewhere in your script before using checkpoints. Alternatively if you want to save to memory instead you can uselocalCheckpoint()instead ofcheckpoint()but that is not reliable and in case of issues/after termination the checkpoints will be lost (but it should be faster as it uses the caching subsystem instead of only writing to disk).And yes, it should be read automatically, you can look at the history server and there should be "load data" nodes (I don't remember the exact name) at the start of blocks/queries