This question is a follow-up from this thread
I'd like to perform three actions on a disk frame
- Count the distinct values of the field
idgrouped by two columns (key_a and key_b) - Count the distinct values of the field
idgrouped by the first of two columns (key_a) - Add a column with the distinct values for the first column / the distinct values across both columns
This is my code
my_df <-
data.frame(
key_a = rep(letters, 384),
key_b = rep(rev(letters), 384),
id = sample(1:10^6, 9984)
)
my_df %>%
select(key_a, key_b, id) %>%
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)
My data is in the format of a disk frame, not a data frame, and it has 100M rows and 8 columns.
I'm following the two step instructions described in this documentation
I'm concerned that the collect will crash my machine since it brings everything to ram
Do I have to use collect in order to use dplyr group bys in disk frame?
You should always use
srckeepto load only those columns you need into memory.collectwill only bring the results of computingchunk_group_byandchunk_summarizeinto RAM. It shouldn't crash your machine.You must use
collectjust like other systems like Spark.But if you are computing
n_distinct, that can be done in one-stage anywayIf you really concerned about RAM usage, you can reduce the number of workers to 1