Is there any good way to read the content of a Spark RDD into a Dask structure

517 Views Asked by LetsPlayYahtzee At 21 June 2025 at 17:34

Currently the integration between Spark structures and Dask seems cubersome when dealing with complicated nested structures. Specifically dumping a Spark Dataframe with nested structure to be read by Dask seems to not be very reliable yet although the parquet loading is part of a large ongoing effort (fastparquet, pyarrow);

so my follow up question - Let's assume that I can live with doing a few transformations in Spark and transform the DataFrame into an RDD that contains custom class objects; Is there a way to reliably dump the data of an Spark RDD with custom class objects and read it in a Dask collection? Obviously you can collect the rdd into a python list, pickle it and then read it as a normal data structure but that removes the opportunity to load larger than memory datasets. Could something like the spark pickling be used by dask to load a distributed pickle?

Original Q&A

There are 1 best solutions below

LetsPlayYahtzee On 06 November 2018 at 13:53 BEST ANSWER

I solved this by doing the following

Having a Spark RDD with a list of custom objects as Row values I created a version of the rdd where I serialised the objects to strings using cPickle.dumps. Then converted this RDD to a simple DF with string columns and wrote it to parquet. Dask is able to read parquet files with simple structure. Then deserialised with cPickle.loads to get the original objects

Is there any good way to read the content of a Spark RDD into a Dask structure

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYSPARK

Related Questions in DASK

Related Questions in DASK-DISTRIBUTED

Related Questions in FASTPARQUET

Trending Questions

Popular # Hahtags

Popular Questions