I use the memoise
package to cache queries to an arrow
dataset but I sometimes get mismatches/"collisions" in hashes and therefore the wrong values are returned.
I have isolated the problem and replicated it in the MWE below.
The issue is that the rlang::hash()
(which memoise
uses) of an arrow query that first filters then summarises does not depend on the filter.
My question is: is this something that I can fix (because I used it wrongly) or is this a bug in the one of the packages (I am happy to create an issue), if so, should this be reported to arrow
, rlang::hash()
, or even R6
?
MWE
For example, all three queries below have the same hash but they should be different (and looking at the results, the results obviously are...)
library(arrow)
library(dplyr)
ds_file <- file.path(tempdir(), "mtcars")
write_dataset(mtcars, ds_file)
ds <- open_dataset(ds_file)
# 1) Create three different queries =======
# Query 1 with mpg > 25 ----
query1 <- ds |>
filter(mpg > 25) |>
group_by(vs) |>
summarise(n = n(), mean_mpg = mean(mpg))
# Query 2 with mpg > 0 ----
query2 <- ds |>
filter(mpg > 0) |>
group_by(vs) |>
summarise(n = n(), mean_mpg = mean(mpg))
# Query 3 with filter on cyl ----
query3 <- ds |>
filter(cyl == 4) |>
group_by(vs) |>
summarise(n = n(), mean_mpg = mean(mpg))
# 2) Lets compare the hashes: the main issue ======
rlang::hash(query1)
#> [1] "f505339fd65df6ef53728fcc4b0e55f7"
rlang::hash(query2)
#> [1] "f505339fd65df6ef53728fcc4b0e55f7"
rlang::hash(query3)
#> [1] "f505339fd65df6ef53728fcc4b0e55f7"
# ERROR HERE: they should be different as the queries are different!
# 3) Lets also compare the results: clearly different =====
query1 |> collect()
#> # A tibble: 2 × 3
#> vs n mean_mpg
#> <dbl> <int> <dbl>
#> 1 1 5 30.9
#> 2 0 1 26
query2 |> collect()
#> # A tibble: 2 × 3
#> vs n mean_mpg
#> <dbl> <int> <dbl>
#> 1 0 18 16.6
#> 2 1 14 24.6
query3 |> collect()
#> # A tibble: 2 × 3
#> vs n mean_mpg
#> <dbl> <int> <dbl>
#> 1 1 10 26.7
#> 2 0 1 26
Note that the same error happens when I use digest
.
When I print the queries, they are printed as if they were identical... (I reported this bug here to arrow)
query1
#> FileSystemDataset (query)
#> vs: double
#> n: int32
#> mean_mpg: double
#>
#> See $.data for the source Arrow object
query2
#> FileSystemDataset (query)
#> vs: double
#> n: int32
#> mean_mpg: double
#>
#> See $.data for the source Arrow object
query3
#> FileSystemDataset (query)
#> vs: double
#> n: int32
#> mean_mpg: double
#>
#> See $.data for the source Arrow object
but when I query the $.data
argument of the query, I see that they are in fact different
query1$.data
#> FileSystemDataset (query)
#> mpg: double
#> vs: double
#>
#> * Aggregations:
#> n: sum(1)
#> mean_mpg: mean(mpg)
#> * Filter: (mpg > 25) #<=========
#> * Grouped by vs
#> See $.data for the source Arrow object
query2$.data
#> FileSystemDataset (query)
#> mpg: double
#> vs: double
#>
#> * Aggregations:
#> n: sum(1)
#> mean_mpg: mean(mpg)
#> * Filter: (mpg > 0) #<=========
#> * Grouped by vs
#> See $.data for the source Arrow object
query3$.data
#> FileSystemDataset (query)
#> mpg: double
#> vs: double
#>
#> * Aggregations:
#> n: sum(1)
#> mean_mpg: mean(mpg)
#> * Filter: (cyl == 4) #<=========
#> * Grouped by vs
#> See $.data for the source Arrow object
but again rlang::hash()
cannot find a difference:
rlang::hash(query1$.data)
#> [1] "b7f743cd635f7dc06356b827a6974df8"
rlang::hash(query2$.data)
#> [1] "b7f743cd635f7dc06356b827a6974df8"
rlang::hash(query3$.data)
#> [1] "b7f743cd635f7dc06356b827a6974df8"
If it helps, the query objects are R6
objects with class arrow_dplyr_query
(see also its source code in apache/arrow)
Memoise use case
For completeness sake and to put the problem into perspective, I use the following to cache the results, which should return different values (see above) but doesn't!
library(arrow)
library(memoise)
library(dplyr)
ds_file <- file.path(tempdir(), "mtcars")
write_dataset(mtcars, ds_file)
ds <- open_dataset(ds_file)
collect_cached <- memoise::memoise(dplyr::collect,
cache = cachem::cache_mem(logfile = stdout()))
# Query 1 with mpg > 25 ----
ds |>
filter(mpg > 25) |>
group_by(vs) |>
summarise(n = n(), mean_mpg = mean(mpg)) |>
collect_cached()
#> [2022-11-25 09:16:28.586] cache_mem get: key "2edd901226498414056dcc54eaa49415"
#> [2022-11-25 09:16:28.586] cache_mem get: key "2edd901226498414056dcc54eaa49415" is missing
#> [2022-11-25 09:16:28.705] cache_mem set: key "2edd901226498414056dcc54eaa49415"
#> [2022-11-25 09:16:28.706] cache_mem prune
#> # A tibble: 2 × 3
#> vs n mean_mpg
#> <dbl> <int> <dbl>
#> 1 1 5 30.9
#> 2 0 1 26
# Query 2 with mpg > 0 ----
# this is wrongly matched to the first query and returns wrong results...
ds |>
filter(mpg > 0) |>
group_by(vs) |>
summarise(n = n(), mean_mpg = mean(mpg)) |>
collect_cached()
#> [2022-11-25 09:16:28.820] cache_mem get: key "2edd901226498414056dcc54eaa49415"
#> [2022-11-25 09:16:28.820] cache_mem get: key "2edd901226498414056dcc54eaa49415" found #< ERROR HERE! as the hash is identical
#> # A tibble: 2 × 3
#> vs n mean_mpg
#> <dbl> <int> <dbl>
#> 1 1 5 30.9
#> 2 0 1 26
Note that we get the same result although the queries are different (yet their hashes are identical, hence this question).
This is very much a hack ... but perhaps it'll be enough? I was able to find something unique-enough about the intermediate "query" that included its filter components by capturing the output from
show_query
, and using that as thehash=
argument tomemoise
:The object passed to
hashfun
is a list, where the first argument appears to be a checksum or salt of a sort (we'll ignore it), and all remaining arguments (named or otherwise) are determined by the formals of the cached function. In our case, since we're cachingcollect
, it acceptsx=
(which we see) and...=
(which we don't):Just replacing
x$x
with the return fromshow_query(x$x)
didn't seem to work since there appear to be things only in theprint
ed form that are not readily available torlang::hash
, so I chosecapture.output
.