trouble reading data from .csv and .xlsx files in R

97 Views Asked by At

I'm trying to import some data in R but the imported data differs from the original one in the smaller digits in a way I'm not really able to understand why.

First I tried to import with read.csv the file "test 1.csv", which contains a single number:

1761,74956350536

But the imported number is not identical to the original one:

> x <- read.csv2("test 1.csv", header = FALSE)
> print(x, digits = 20)
                     V1
1 1761.7495635053599017

I first considered that R might be attempting to read the data into a kind of variable which would not accommodate this many digits. But in fact, the number read by R is "more precise" (has more digits) than the original one. And it was correctly read as a numeric type (thus double precision floating point).

I then tried to write x back to another .csv file:

> write.csv2(x, "test 1 out.csv", row.names = FALSE, col.names = FALSE)

With the result that the original number (1761,74956350536) was "restored".

I then thought it might be something on the way the print function works and the number read by read.csv was correct. I then created a spreadsheet in Excel (test 2.xlsx) with five numbers with the sum of these in the sixth column. Then exported it to a .csv file. The resulting file (test 2.csv) reads:

1761,749563505360000;2819,170384367300000;39,526631074886200;1600,629524205790000;0,556813868271186;6221,632917021610000

Notice the first number of the series is the same as the one contained in the original "test 1.csv" file. I then imported it into R and checked the sum.

> x <- read.csv2("test 2.csv", header = FALSE)
> conf <- (sum(x[1:5])-x[6])
> conf
             V6
1 -9.094947e-13
> print(conf, digits = 20)
                          V6
1 -9.0949470177292823792e-13

And the first five numbers no longer add up to the sixth.

I tried other import methods, directly from the .xlsx file, using both xlsx and openxlsx packages:

> library(xlsx)
> x <- read.xlsx("test 2.xlsx", 1, header = FALSE)
> print(x[1], digits = 20)
                     X1
1 1761.7495635053587648
> conf <- (sum(x[1:5])-x[6])
> conf
  X6
1  0
> print(conf, digits = 20)
  X6
1  0
> write.xlsx(x, "test 2 out.xlsx", col.names=FALSE, row.names=FALSE)

And:

> library(openxlsx)
> x <- read.xlsx("test 2.xlsx", 1, colNames = FALSE)
> print(x[1], digits = 20)
                     X1
1 1761.7495635053587648
> conf <- (sum(x[1:5])-x[6])
> conf
  X6
1  0
> print(conf, digits = 20)
  X6
1  0
> write.xlsx(x, "test 2 out.xlsx", col.names=FALSE, row.names=FALSE)

Now we got still a third number, different from the original one and also different from the one read by read.csv. Once again write.xlsx "restores" the original numbers to "test 2 out.xlsx". In this case the checked sum matches, but unfortunately this is not the case for these packages and functions for a larger set of numbers with which I'm currently working on.

I feel I'm asking something which either obvious or well known (I'm a newbie...), for the operation is quite simple. But I'm indeed at a loss here. Does anyone know why this happens? Is there a way to avoid/correct/circumvent this behavior? Any help would be very appreciated.

(I'm running R version 3.2.0 (through RStudio) on a Linux Mint 17.1 Rebecca machine.)

0

There are 0 best solutions below