I have several large data.table saved to disk in .rds files. I'm looking for ways to reduce the time required to import data. I was looking into the feather package. Part of my pipeline is to check for any changes in the input data set based on digest::sha1(). I have, as shown in the following example, that a data.table, saved as a rds, can be read in and the digest::sha1() are equal. However, the data saved as a .feather file, read in, and modified to be the same data.tabel, result in different sha1 hashes. I'm confused because checking all.equal and indentical return TRUE but the hashes are unique.
Why is this happening? Is it possible to get the same hash with this type of work flow? How can I easily check if the data has changed if I can't rely on the hashes? (Real data is several million rows by several hundred columns).
library(data.table)
library(feather)
# build an example data set
set.seed(42)
original_data_table <-
data.table(
x = rnorm(100),
y = factor(sample(1:3, size = 100, replace = TRUE), levels = 1:3, labels = c("lvl1", "lvl2", "lvl3"))
,
id = paste0("subject_", 1:100)
)
data.table::setkey(original_data_table, id)
# write data as rds
original_data_table_rds <- tempfile()
original_data_table_feather <- tempfile()
saveRDS(object = original_data_table, file = original_data_table_rds)
feather::write_feather(x = original_data_table, path = original_data_table_feather)
# read in the data objects
from_rds <- readRDS(original_data_table_rds)
from_rds_setted <- data.table::setDT(readRDS(original_data_table_rds))
from_feather <- feather::read_feather(original_data_table_feather)
# translate from_feather from tibble to data.table
data.table::setDT(from_feather)
data.table::setkey(from_feather, id)
# check that objects are equall, and even identical, to the original
all.equal(from_rds, original_data_table) # TRUE
#> [1] TRUE
all.equal(from_rds_setted, original_data_table) # TRUE
#> [1] TRUE
all.equal(from_feather, original_data_table) # TRUE
#> [1] TRUE
identical(from_rds, original_data_table) # FALSE
#> [1] FALSE
identical(from_rds_setted, original_data_table) # TRUE
#> [1] TRUE
identical(from_feather, original_data_table) # TRUE
#> [1] TRUE
digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "cf8cefcc706cbfe343986aa366d44bf2bf965712"
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Denver
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] feather_0.3.5 data.table_1.14.8
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.33 utf8_1.2.3 fastmap_1.1.1 xfun_0.39
#> [5] magrittr_2.0.3 glue_1.6.2 tibble_3.2.1 knitr_1.43
#> [9] pkgconfig_2.0.3 htmltools_0.5.5 rmarkdown_2.23 lifecycle_1.0.3
#> [13] cli_3.6.1 fansi_1.0.4 vctrs_0.6.3 reprex_2.0.2
#> [17] withr_2.5.0 compiler_4.3.1 tools_4.3.1 hms_1.1.3
#> [21] pillar_1.9.0 evaluate_0.21 Rcpp_1.0.11 yaml_2.3.7
#> [25] rlang_1.1.1 fs_1.6.2
Created on 2023-07-16 with reprex v2.0.2
Thanks to @Hobo for the comment on the question I was able to find the solution. @Hobo was correct, it had to do with the attributes. The attributes of
original_data_tableandfrom_featherare, as a set, identical, but are order that the list elements are provided in differ.By setting
attrib.as.set = FALSEin theidenticalcall we can see there is a difference. Looking at the names of the attributes betweenoriginal_data_tableandfrom_featherwe see that the first and third elements are traded positions between the two objects. By setting the order of the attribute elements offrom_featherto the same order asoriginal_data_table, thedigest::sha1values are as expected.Created on 2023-07-17 with reprex v2.0.2