spark efficient distribution pairing to compare cohorts

72 Views Asked by At

How can I efficiently compare matched cohorts in spark?

In python for each observation of the minority class in a highly imbalanced dataset sampling k observations from the majority class can be implemented in a fairly straightforward way (i.e. matching a healthy person for each sick person by age and gender):

Improve performance calculating a random sample matching specific conditions in pandas or python 1:1 stratified sampling per each group

But how can this be scaled out in spark? Naively, a self-join with filter should work. But this fails due to too many tuples being generated.

Are there smarter strategies? Maybe some smart hashing like LSH?

0

There are 0 best solutions below