I am importing data from SPSS (sav file). In this data set there are several date columns, several of which contain cells with missing values. All date columns are imported as date variables into R, so far so good. All date entries from SPSS are imported correctly into R, still so far so good. However, some of the missing values in SPSS are interpreted as 1582-10-14 when imported to R. I understand this is the start of the Gregorian calendar, so I thought maybe some of the cells in SPSS are 0 instead of NA, but when checking said cells, they are indeed empty, and look exactly the same as other empty cells which are correctly imported as NA in R.
I am sorry I can't think of a reproducible example and I can't put my SPSS file here, integrityreasons.
I have tried importing using rio (import_list
) and haven(read_sav
), and the same cells are interpreted as 1582-10-14 when using both methods. Maybe it is not random and there is some hidden data in the SPSS file not visible in the SPSS interface?
Any idea why I get this error? Could it be a bug? My experience though is that it seldom is. Has anyone else experienced this?
A possible fix would be just to remove all 1582-10-14 entries in R but I would like know the reason why it occurred. What if more data imported from SPSS is incorrect?
Edit: This post has been suggested as a solution, which it is not. That post concerns R importing dates from SPSS as absolute numbers in seconds from 1582-10-14 and how to convert these values to actual dates, if I understand correctly. This post concerns R incorrectly interpreting some of the missing values of an SPSS date column as 1582-10-14.
Here is a screenshot of how the data looks in SPSS and a dput()
of how it looks after importing to R.
datetest <- structure(list(datevariable = structure(c(NA, 18394,
NA, 17914, -141428, -141428, NA, -141428, NA, NA, NA,
-141428, NA, NA, -141428, -141428, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17765, NA, NA,
NA, NA, NA, NA, NA, -141428, NA, NA, NA, NA, NA, NA, NA,
19278, NA, NA, NA, NA, NA, NA, NA, 19121, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), label = "datevariable", class = "Date", format.spss = "SDATE10")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -77L))
As you can see row numbers 2 and 4 are imported correctly. The missing values of row 5,6,8,12,15... are imported as 1582-10-14.
I have an incomplete and possibly unsatisfying answer.
I downloaded your file and ran the following code:
In fact,
rio
is just a wrapper for the code in thehaven
or (alternatively, if requested)foreign
packages, sox2
is redundant ... I inspected the results, using theprint_hex()
function defined below, and viewing the file in "hexl-mode" in Emacs (I imagine you can find other hex-editors online etc.). The advantage of comparingforeign::read.spss()
andhaven::read_spss()
is thatread.spss
actually does less processing of the data (producing weird long numbers that can then be translated into dates), which makes it a little easier to see what's going on.Looking at the results, comparing the first few elements of
x1[[i]]
(numeric value);print_hex(x1[[i]])
(underlying hex representation); andx3[[i]]
(date value), we getSo it's definitely the case that some of the missing values are being encoded by SPSS as zero and some as an
NA
value. Unfortunately, why the missing values are being encoded differently is a mystery to me.I tried looking at the raw hex code dump, but it's painful and doesn't help very much. At the end of line 2 below, you can see the hex value corresponding to element 2 coded as
00 0000 a876 b809 42
, which is in backwards reversed order, i.e. start with the last pair (42) and read each pair going right-to-left (09, b8, 76 ...). The value for element 4 seems to be right after it (end of line 2, beginning of line 3). This is about where I get lost, because I don't know what kind of compression/other magic is being done in the file encoding ...From Double precision (64-bit) representation of numeric value in R (sign, exponent, significand) :