Is it possible to get the partition_id
in dask
after splitting pandas DFs
For example:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(np.random.randn(10,2), columns=["A","B"])
df_parts = dd.from_pandas(df, npartitions=2)
part1 = df_parts.get_partition(0)
In the 2 parts, part1
is the first_partition
. So is it possible to do something like the following:
part1.get_partition_id() => which will return 0 or 1
Or is it possible to get the partition ID
by iterating through df_parts
?
Not sure about built-in functions, but you can achieve what you want with
enumerate(df_parts.to_delayed())
.to_delayed
will produce a list of delayed objects, one per partition, so you can iterate over them, keeping track of the sequential number withenumerate
.