spark efficient distribution pairing to compare cohorts

72 Views Asked by Georg Heiler At 03 September 2019 at 14:07

How can I efficiently compare matched cohorts in spark?

In python for each observation of the minority class in a highly imbalanced dataset sampling k observations from the majority class can be implemented in a fairly straightforward way (i.e. matching a healthy person for each sick person by age and gender):

Improve performance calculating a random sample matching specific conditions in pandas or python 1:1 stratified sampling per each group

But how can this be scaled out in spark? Naively, a self-join with filter should work. But this fails due to too many tuples being generated.

Are there smarter strategies? Maybe some smart hashing like LSH?

Original Q&A

spark efficient distribution pairing to compare cohorts

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in MATCHING

Related Questions in MULTISAMPLING

Related Questions in SUBSAMPLING

Trending Questions

Popular # Hahtags

Popular Questions