I have a very large DataFrame, that I partition based on values in one column "A" using the dask.DataFrame.set_index() method. Such N partitions are still too large to fit into memory when mapping a function "f()" on the dask DataFrame dd. I would like to further split/partition each of these N partitions in, say, m smaller DataFrames (can be of equal size, or not). This should allow me to dd.map_partitions(f) in an optimal way, given the resources on my cluster.
I tried using the repartition() method on the partitioned dd, but I am either stuck with the N partitions, or ending up with 10 partitions with mixed values of A (which isn't compatible with how my function f works). One idea would be to dd.map_partitions(repartition, 10) to apply repartition on each df within dd, but that seems quite convoluted. Any (better) suggestions? Thanks! p.s.: I am on my phone and can't easily paste a template case, will do later if needed.