In a related question I had some good help to generate possible combinations of a set or variables.
Assume the output of that process is
combo_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
combo_id = c("combo1", "combo2", "combo3"),
selection_1 = c("Alice", "Alice", "Bob"),
selection_2 = c("Bob", "Cat", "Cat")
),
name = "combo_table")
This is a tbl reference to a spark data frame object with two columns, each representing a selection of 2 values from a list of 3 (Alice, Bob, Cat), that could be imagined as 3 members of a household.
Now there is also a spark data frame with a binary encoding indicating a 1 if the member of the house was in the house, and 0 where they were not.
obs_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
obs_day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
Alice = c(1, 1, 0, 1, 0, 1, 0),
Bob = c(1, 1, 1, 0, 0, 0, 0),
Cat = c(0, 1, 1, 1, 1, 0, 0)
),
name = "obs_table")
I can relatively simply check if a specific pair were present in the house with this code:
obs_tbl %>%
group_by(Alice, Bob) %>%
summarise(n())
However, there are 2 flaws with this approach.
- Each pair is being put in manually, when every combination I need to check is already in
combo_tbl
. - The output automatically outputs the intersection of every combination. i.e. I get the count of values where both Alice and Bob == 1, but also where Alice ==1 and Bob == 0, Alice == 0 and Bob ==1, etc.
The ideal end result would be an output like so:
Alice | Bob | 2
Alice | Cat | 2
Bob | Cat | 2
i.e. The count of co-habitation days per pair.
A perfect solution would allow simple modification to change the number of selection within the combination to increase. i.e. each combo_id may have 3 or greater selections, from a larger list than the one given.
So, is it possible on sparklyr to pass a vector of pairs that are iterated through?
How do I only check for where both of my selections are present? Instead of a vectorised group_by
should I use a vectorised filter
?
I've read about quosures and standard evaluation in the tidyverse. Is that the solution to this if running locally? And if so is this supported by spark?
For reference, I have a relatively similar solution using data.table that can be run on a single-machine, non-spark context. Some pseudo code:
combo_dt[, obs_dt[get(tolower(selection_1)) == "1" &
get(tolower(selection_2)) == "1"
, .N], by = combo_id]
This nested process effectively splits each combination into it's own sub-table: by = combo_id
, and then for that sub-table filters where selection_1
and selection_2
are 1, and then applies .N
to count the rows in that sub-table, and then aggregates the output.