Reproducibility: Failing to rerun code over time

133 Views Asked by At

I fear that a running code could fail in the future. I've seen this with tidyverse functions that were running well but after a time returned an error because they had been Defunct. To give some reproducible example try this piece of code from How to make a great R reproducible example that ironically is not reproducible anymore (compare values of age and x to the original post):

set.seed(42)  ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
dat
  id       date group age   type           x
1  1 2020-12-26     A  29 type 1  0.63286260
2  2 2020-12-27     B  30 type 2  0.40426832
3  3 2020-12-28     A  21 type 3 -0.10612452
4  4 2020-12-29     B  28 type 4  1.51152200
5  5 2020-12-30     A  26 type 5 -0.09465904
6  6 2020-12-31     B  24 type 6  2.01842371

Question

Is it only after updates the case that the very same code returns a different output? In other words: packages and R itself usually do not update automatically, so does it mean I can rerun a function for a "eternity" as long as I do not update anything manually? Are there any exceptions?

Why I ask

I do the encryption of sensitive data for my company using the bcrypt package in R. We need to encrypt data and delete the original data. Once this is done there is no way back, i.e. I really have to trust the code. I use no pacakges but bcrypt, shiny and shinydashboard.

Edit

My question assumes that the code is being run on the same system without changing global settings (edit after comment from @qdread) with no changes to the R version.

What I do in detail: I work with patient data. Firstly, I choose a random ID that consists of letters and numbers for every patient, e.g. A72CV for Max Cooper 1987-05-03. In the next step I use bcrypt to create salts for every patient and then I create hashed/ encrypted versions of the IDs using the salts (salt + ID = encrypted ID). So every patient has name + birthdate, a random letters/ numbers ID, a salt (generated using salt <- bcrypt::gensalt(log_rounds = 12)) and the encrypted ID (generated using id_encrypted <- bcrypt::hashpw(id, salt = salt)). I save the data in three separated files: (i) patient data, i.e. name and birthdate, and encrypted ID, (ii) IDs and salts and (iii) the actual database with IDs and a number of variables of interest, e.g. smoker/ weight,... This approach is recommended by some institutions in the context where I work and it is called pseudonymisation (a reversible encryption). It ensures that even if there are data leaks there is no obvious connection between the identifying variables name + birthday and all the variables of interest (smoker,...). I made a shinyApp that allows my co-workers to (1) provide ID and look up name + birthdate, (2) provide name + birthdate and look up ID and (3) generate an ID for a new patient. This all works because the same ID with the same salt results in the same encrypted (hashed) ID - at least as for now this is the case. But if in future for some reasons the same input (e.g. ID) does not return the same output (e.g. name + birthdate) I am totally screwed. On the other hand, it is not a big problem if the generation of the random IDs will change over time because each ID is create and saved just once, i.e. this process does not have to be reproducible. The described encryption method will be applied to a few databases that took my institution many years to collect. If we can not recreate the data, all is lost. That is why code stability is so important to me. I will install shinyApp on windows computers of my colleagues. They will just hit run App inside R and then do one of the options described before (1 to 3).

1

There are 1 best solutions below

7
r2evans On

(Partial answer.)

The default behavior of sample changed in R-3.6.0. Notable, in NEWS-3 under R-3.6.0, it states under SIGNIFICANT USER-VISIBLE CHANGES:

The default method for generating from a discrete uniform distribution (used in sample(), for instance) has been changed. This addresses the fact, pointed out by Ottoboni and Stark, that the previous method made sample() noticeably non-uniform on large populations. See PR#17494 for a discussion. The previous method can be requested using RNGkind() or RNGversion() if necessary for reproduction of old results. Thanks to Duncan Murdoch for contributing the patch and Gabe Becker for further assistance.

We can regain the age random values by changing the sample.kind="Rounding",

RNGkind(sample.kind = "Rounding")
# Warning in RNGkind(sample.kind = "Rounding") :
#   non-uniform 'Rounding' sampler used

set.seed(42)  ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
dat
#   id       date group age   type           x
# 1  1 2020-12-26     A  29 type 1  0.63286260
# 2  2 2020-12-27     B  30 type 2  0.40426832
# 3  3 2020-12-28     A  21 type 3 -0.10612452
# 4  4 2020-12-29     B  28 type 4  1.51152200
# 5  5 2020-12-30     A  26 type 5 -0.09465904
# 6  6 2020-12-31     B  24 type 6  2.01842371

As for the changed rnorm output, it was noted in the same link that

Note: The output of set.seed() differs between R >3.6.0 and previous versions. Specify which R version you used for the random process, and don't be surprised if you get slightly different results when following old questions. To get the same result in such cases, you can use the RNGversion()-function before set.seed() (e.g.: RNGversion("3.5.2")).

Unfortunately, I cannot reproduce the link's version of the x-column.


How to deal with it in production? It is always sketchy (for reasons such as this) to rely on truly random numbers in unit-tests, for two main reasons: you cannot always assumed that unseeded random values will hit the corner-cases you want; and seeded random values are subject to "bug-fixes" or improvements to the PRNG process, as you're seeing here.