i am currently using pyspark to perform some data cleaning for a machine learning application. The last session crashed but i set up an checkpointdir and checkpointed my DataFrame.
Now i have checkpointed data directory in the form of:
id-of-checkpoint-dir/
\\- rdd-123/
\\- rdd-456/
The files in the rdd-subfolders seem to be hex files.
How can i read this checkpoint so i can cuntinue my data preparation instead of running the whole process again?
I don't know how to read checkpointed Dataframe, but know how to read checkpointed RDD. And you can use the following code to convert DataFrame to RDD.
The code for reading checkpointed RDD is as follows:
the definition of function
_checkpointFileis as follows, of which the parameterinput_deserializermaybe need keep the same with classRDD. https://github.com/apache/spark/blob/dd4db21cb69a9a9c3715360673a76e6f150303d4/python/pyspark/context.py#LL1674C8-L1674C8for example, in spark 2.4.8, the deserializer is
AutoBatchedSerializer(PickleSerializer())https://spark.apache.org/docs/2.4.8/api/python/pyspark.html#pyspark.RDDwhile, in spark 3.4.0, the deserializer is AutoBatchedSerializer(CloudPickleSerializer()) https://spark.apache.org/docs/3.4.0/api/python/reference/api/pyspark.RDD.html#pyspark.RDD