I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria.
The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python:
def reduce_frame(partition):
records = partition.to_dict('record')
shortlisted_records = []
# Use Python to locate promising looking records.
# Some of the criteria can be cythonized; one criteria
# revolves around whether record is a parent or child
# of records in shortlisted_records.
for other in shortlisted_records:
if other.path.startswith(record.path) \
or record.path.startswith(other.path):
... # keep one, possibly both
...
return pd.DataFrame.from_dict(shortlisted_records)
df = df.groupby('key').apply(reduce_frame, meta={...})
In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. (The dataset in its entirety is not, hence my using Dask.)
Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python.
Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? Ideally one that allows to parallelize the computations across a cluster? For instance by using Dask.bag or Dask.delayed and/or cytoolz or something else I might have missed while learning Python?
I know nothing about Dask, but can tell a little on passing / blocking some rows using Pandas.
It is possible to use groupby(...).apply(...) to "filter" the source DataFrame.
Example:
df.groupby('key').apply(lambda grp: grp.head(2))returns first 2 rows from each group.In your case, write a function to applied to each group, which:
The returned rows are then concatenated, forming the result of apply.
Another possibility is to use groupby(...).filter(...), but in this case the underlying function returns a decision "passing" or "blocking" each group of rows.
Yet another possibility is to define a "filtering function", say
filtFun, which returns True (pass the row) or False (block the row).Then:
msk = df.apply(filtFun, axis=1)to generate a mask (which rows passed the filter).df[msk], i.e. only these rows which passed the filter.But in this case the underlying function has acces only to the current row, not to the whole group of rows.