Count by vector of multiple columns in sparklyr

400 Views Asked by At

In a related question I had some good help to generate possible combinations of a set or variables.

Assume the output of that process is

combo_tbl <- sdf_copy_to(sc = sc,
                     x = data.frame(
                       combo_id = c("combo1", "combo2", "combo3"),
                       selection_1 = c("Alice", "Alice", "Bob"),
                       selection_2 = c("Bob", "Cat", "Cat")
                     ),
                     name = "combo_table")

This is a tbl reference to a spark data frame object with two columns, each representing a selection of 2 values from a list of 3 (Alice, Bob, Cat), that could be imagined as 3 members of a household.

Now there is also a spark data frame with a binary encoding indicating a 1 if the member of the house was in the house, and 0 where they were not.

obs_tbl <- sdf_copy_to(sc = sc,
                   x = data.frame(
                     obs_day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
                     Alice = c(1, 1, 0, 1, 0, 1, 0),
                     Bob = c(1, 1, 1, 0, 0, 0, 0),
                     Cat = c(0, 1, 1, 1, 1, 0, 0)
                   ),
                   name = "obs_table")

I can relatively simply check if a specific pair were present in the house with this code:

obs_tbl %>%
  group_by(Alice, Bob) %>%
  summarise(n())

However, there are 2 flaws with this approach.

  1. Each pair is being put in manually, when every combination I need to check is already in combo_tbl.
  2. The output automatically outputs the intersection of every combination. i.e. I get the count of values where both Alice and Bob == 1, but also where Alice ==1 and Bob == 0, Alice == 0 and Bob ==1, etc.

The ideal end result would be an output like so:

Alice | Bob | 2
Alice | Cat | 2
Bob   | Cat | 2

i.e. The count of co-habitation days per pair.

A perfect solution would allow simple modification to change the number of selection within the combination to increase. i.e. each combo_id may have 3 or greater selections, from a larger list than the one given.

So, is it possible on sparklyr to pass a vector of pairs that are iterated through? How do I only check for where both of my selections are present? Instead of a vectorised group_by should I use a vectorised filter?

I've read about quosures and standard evaluation in the tidyverse. Is that the solution to this if running locally? And if so is this supported by spark?

For reference, I have a relatively similar solution using data.table that can be run on a single-machine, non-spark context. Some pseudo code:

combo_dt[, obs_dt[get(tolower(selection_1)) == "1" &
                  get(tolower(selection_2)) == "1"
                          , .N], by = combo_id]

This nested process effectively splits each combination into it's own sub-table: by = combo_id, and then for that sub-table filters where selection_1 and selection_2 are 1, and then applies .N to count the rows in that sub-table, and then aggregates the output.

0

There are 0 best solutions below