rio and haven randomly interpret NAs in date column as 1582-10-14 when importing from SPSS

86 Views Asked by At

I am importing data from SPSS (sav file). In this data set there are several date columns, several of which contain cells with missing values. All date columns are imported as date variables into R, so far so good. All date entries from SPSS are imported correctly into R, still so far so good. However, some of the missing values in SPSS are interpreted as 1582-10-14 when imported to R. I understand this is the start of the Gregorian calendar, so I thought maybe some of the cells in SPSS are 0 instead of NA, but when checking said cells, they are indeed empty, and look exactly the same as other empty cells which are correctly imported as NA in R.

I am sorry I can't think of a reproducible example and I can't put my SPSS file here, integrityreasons.

I have tried importing using rio (import_list) and haven(read_sav), and the same cells are interpreted as 1582-10-14 when using both methods. Maybe it is not random and there is some hidden data in the SPSS file not visible in the SPSS interface?

Any idea why I get this error? Could it be a bug? My experience though is that it seldom is. Has anyone else experienced this?

A possible fix would be just to remove all 1582-10-14 entries in R but I would like know the reason why it occurred. What if more data imported from SPSS is incorrect?

Edit: This post has been suggested as a solution, which it is not. That post concerns R importing dates from SPSS as absolute numbers in seconds from 1582-10-14 and how to convert these values to actual dates, if I understand correctly. This post concerns R incorrectly interpreting some of the missing values of an SPSS date column as 1582-10-14.

Here is a screenshot of how the data looks in SPSS and a dput() of how it looks after importing to R.

Screenshot of how the data in SPSS

Link to SPSS dataset

datetest <- structure(list(datevariable = structure(c(NA, 18394, 
 NA, 17914, -141428, -141428, NA, -141428, NA, NA, NA, 
 -141428, NA, NA, -141428, -141428, NA, NA, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17765, NA, NA, 
 NA, NA, NA, NA, NA, -141428, NA, NA, NA, NA, NA, NA, NA,
 19278, NA, NA, NA, NA, NA, NA, NA, 19121, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
 NA, NA, NA, NA), label = "datevariable", class = "Date", format.spss = "SDATE10")),
     class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -77L))

As you can see row numbers 2 and 4 are imported correctly. The missing values of row 5,6,8,12,15... are imported as 1582-10-14.

1

There are 1 best solutions below

1
On

I have an incomplete and possibly unsatisfying answer.

I downloaded your file and ran the following code:

library(foreign)
library(rio)
x1 <- read.spss("spss_test.sav")
x2 <- rio::import("spss_test.sav")
x3 <- haven::read_spss("spss_test.sav")

In fact, rio is just a wrapper for the code in the haven or (alternatively, if requested) foreign packages, so x2 is redundant ... I inspected the results, using the print_hex() function defined below, and viewing the file in "hexl-mode" in Emacs (I imagine you can find other hex-editors online etc.). The advantage of comparing foreign::read.spss() and haven::read_spss() is that read.spss actually does less processing of the data (producing weird long numbers that can then be translated into dates), which makes it a little easier to see what's going on.

Looking at the results, comparing the first few elements of x1[[i]] (numeric value); print_hex(x1[[i]]) (underlying hex representation); and x3[[i]] (date value), we get

  • elements 1 and 3: NA, 7ff00000000007a2, NA
  • element 2: 13808620800, 4209b876a8000000, 2020-05-12
  • element 4: 13767148800 , 4209a4b028000000, 2019-01-18
  • element 5: 0, 0, 1582-10-14

So it's definitely the case that some of the missing values are being encoded by SPSS as zero and some as an NA value. Unfortunately, why the missing values are being encoded differently is a mystery to me.

I tried looking at the raw hex code dump, but it's painful and doesn't help very much. At the end of line 2 below, you can see the hex value corresponding to element 2 coded as 00 0000 a876 b809 42, which is in backwards reversed order, i.e. start with the last pair (42) and read each pair going right-to-left (09, b8, 76 ...). The value for element 4 seems to be right after it (end of line 2, beginning of line 3). This is about where I get lost, because I don't know what kind of compression/other magic is being done in the file encoding ...

01           0000 5554 462d 38e7 0300 0000 0000 00ff  ..UTF-8.........
02 000001f0: fdff fd64 64ff 6400 0000 a876 b809 4200  ...dd.d....v..B.
03 00000200: 0000 28b0 a409 42ff ffff 64ff ff64 64ff  ..(...B...d..dd.
04 00000210: ffff ffff ffff ffff ffff ffff ffff fd00  ................
05 00000220: 0000 ac8c 9e09 42ff ffff ffff ffff 64ff  ......B.......d.

From Double precision (64-bit) representation of numeric value in R (sign, exponent, significand) :

Rcpp::cppFunction('void print_hex(double x) {
    uint64_t y;
    static_assert(sizeof x == sizeof y, "Size does not match!");
    std::memcpy(&y, &x, sizeof y);
    Rcpp::Rcout << std::hex << y << std::endl;
}', plugins = "cpp11", includes = "#include <cstdint>")