identical(X1, X2) is TRUE, but digest::sha1(X1) != digest::sha1(X2)

36 Views Asked by At

I have several large data.table saved to disk in .rds files. I'm looking for ways to reduce the time required to import data. I was looking into the feather package. Part of my pipeline is to check for any changes in the input data set based on digest::sha1(). I have, as shown in the following example, that a data.table, saved as a rds, can be read in and the digest::sha1() are equal. However, the data saved as a .feather file, read in, and modified to be the same data.tabel, result in different sha1 hashes. I'm confused because checking all.equal and indentical return TRUE but the hashes are unique.

Why is this happening? Is it possible to get the same hash with this type of work flow? How can I easily check if the data has changed if I can't rely on the hashes? (Real data is several million rows by several hundred columns).

library(data.table)
library(feather)

# build an example data set
set.seed(42)

original_data_table <-
  data.table(
             x = rnorm(100),
             y = factor(sample(1:3, size = 100, replace = TRUE), levels = 1:3, labels = c("lvl1", "lvl2", "lvl3"))
             , 
             id = paste0("subject_", 1:100)
  )

data.table::setkey(original_data_table, id)

# write data as rds
original_data_table_rds <- tempfile()
original_data_table_feather <- tempfile()
saveRDS(object = original_data_table, file = original_data_table_rds)
feather::write_feather(x = original_data_table, path = original_data_table_feather)

# read in the data objects
from_rds        <- readRDS(original_data_table_rds)
from_rds_setted <- data.table::setDT(readRDS(original_data_table_rds))
from_feather    <- feather::read_feather(original_data_table_feather)

# translate from_feather from tibble to data.table
data.table::setDT(from_feather)
data.table::setkey(from_feather, id)

# check that objects are equall, and even identical, to the original
all.equal(from_rds, original_data_table)         # TRUE
#> [1] TRUE
all.equal(from_rds_setted, original_data_table)  # TRUE
#> [1] TRUE
all.equal(from_feather, original_data_table)     # TRUE
#> [1] TRUE

identical(from_rds, original_data_table)         # FALSE
#> [1] FALSE
identical(from_rds_setted, original_data_table)  # TRUE
#> [1] TRUE
identical(from_feather, original_data_table)     # TRUE
#> [1] TRUE

digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "cf8cefcc706cbfe343986aa366d44bf2bf965712"

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Denver
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] feather_0.3.5     data.table_1.14.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.33   utf8_1.2.3      fastmap_1.1.1   xfun_0.39      
#>  [5] magrittr_2.0.3  glue_1.6.2      tibble_3.2.1    knitr_1.43     
#>  [9] pkgconfig_2.0.3 htmltools_0.5.5 rmarkdown_2.23  lifecycle_1.0.3
#> [13] cli_3.6.1       fansi_1.0.4     vctrs_0.6.3     reprex_2.0.2   
#> [17] withr_2.5.0     compiler_4.3.1  tools_4.3.1     hms_1.1.3      
#> [21] pillar_1.9.0    evaluate_0.21   Rcpp_1.0.11     yaml_2.3.7     
#> [25] rlang_1.1.1     fs_1.6.2

Created on 2023-07-16 with reprex v2.0.2

1

There are 1 best solutions below

0
Peter On

Thanks to @Hobo for the comment on the question I was able to find the solution. @Hobo was correct, it had to do with the attributes. The attributes of original_data_table and from_feather are, as a set, identical, but are order that the list elements are provided in differ.

By setting attrib.as.set = FALSE in the identical call we can see there is a difference. Looking at the names of the attributes between original_data_table and from_feather we see that the first and third elements are traded positions between the two objects. By setting the order of the attribute elements of from_feather to the same order as original_data_table, the digest::sha1 values are as expected.

digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "cf8cefcc706cbfe343986aa366d44bf2bf965712"

identical(from_feather, original_data_table, attrib.as.set = FALSE)
#> [1] FALSE
attributes(original_data_table) |> names()
#> [1] "names"             "row.names"         "class"            
#> [4] ".internal.selfref" "sorted"
attributes(from_feather) |> names()
#> [1] "class"             "row.names"         "names"            
#> [4] ".internal.selfref" "sorted"

attributes(from_feather) <- attributes(from_feather)[c("names", "row.names", "class", ".internal.selfref", "sorted")]
identical(from_feather, original_data_table, attrib.as.set = FALSE)
#> [1] TRUE
digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"

Created on 2023-07-17 with reprex v2.0.2