Apologies if this has been applied somewhere else, but I haven't been able to find a solution that works.
I have some data, spread across 4232 parquet
files, that aggregate into roughly 6745697 rows and 30486 columns once the schema is fully parsed via open_dataset
.
The rows need to be collapsed into participant-level summaries, with roughly 3k unique values after grouping.
Some columns are all demographic data (mixtures of strings, integers, and booleans) that are mostly repeated across participants. As such, I decided to just collapse these values across the most frequent occurrences in these variables.
The workflow looks something like this:
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
DOI <- open_dataset(sources = *<path_to_directory_of_parquet_files>*, unify_schemas = TRUE)
Demographics <- DOI |>
filter(treatment_flag == 0, comparator_flag == 0) |>
group_by(Participant) |>
summarize(across(DOI$schema$names[!grepl("VOI_1|VOI_2|Participant", DOI$schema$names)],mode)) |>
collect() |>
as.data.frame()
But throws the following error:
Error: Error in summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) > :
Expression mode(Dataset) is not an aggregate expression or is not supported in Arrow
Call collect() first to pull data into R.
Based on other threads, I assume this is due to the limited number of backends which can be passed through the arrow
functions:
- How to use custom function with Apache Arrow in R?
- How to write anonymous functions in R arrow across
Is there a way this can be accomplished via map
, apply
, or summarize
that can be utilized without first calling the data via collect()
I am missing? Or is it locked to the packages arrow
can utilize?
Thank you in advance.
I don't know when window functions will be supported or when
row_number()
will be usable. Until then, here's a hack that will get you the numbers you want. How efficient and acceptable it is depends on the size of your data, the number of discrete values within each variable, and how much you're willing to pull in at a time. Up front, instead of pulling in 1 row per group, we're pulling inmax(n_distinct(your_variables))
per group.I'll use
mtcars
, and grouping bycyl
, I'll count the other discrete variablesvs
,am
,gear
, andcarb
. Since you're determining the names bygrep
ing on theDOI$schema$names
, I'll do the same.I will first confirm what we expect (by
collect
ing early):My proposal iterates over each of the variables you need, counts them (per group), and pulls the counts themselves, after which we can filter on the max-count (most-frequent).
You may want/need
slice_max(..., with_ties=FALSE)
, depending on your tolerance for either (a) more than 1 row per group, or (b) not knowing there are multiple equal-count levels.