read_csv2 is multiplying some integers by 1000, affecting only the sum() function, how do I correct it?

42 Views Asked by At

I'm reading data from an excel spreadsheet with read_csv2 (the data is using ; as a separator). There's only two columns, let's call them product and count. count is always an integer or NA.

My problem is that the function seems to be adding '.000' to integer values <1000, for example 15 becomes 15.000, and is messing with the sum function.

Here's how I'm reading my data:

data = read_csv2('data.csv', col_types = c('c','n'), col_names = c('product','value'), skip = 1)%>% arrange(product)

Sample of the data:

 product ; count
 product1 ; 1,085 
 product2 ; 205 
 product3 ; 770 
 product4 ; 25 
 product5 ; 50 
 product6 ;
 product7 ;
 product8 ; 3,382 
 product9 ; 1,152 
 product10 ;
 product11 ;
 product12 ; 140 

in this section, this problem occurs with products 2-5 and 12.

This seems to be harmless when assessing individual values, lets say for example that the 15th row contains one of these values.

it shows up on the data frame as 15.000 however data$value[[15]] returns 15, and adding this value with another row which doesn't have this problem works just fine, for example if the 16th row shows up correctly as 5674, data$value[[15]]+data$value[[16]] returns 5689.

However, when I use the sum() function the extra '.000's seem to matter: sum(data$value) returns around 7 million when it should be around 140k.

I've tried to change col_type from 'n' to 'i', doesn't seem to matter, and data$value = round(data$value) also does nothing.

1

There are 1 best solutions below

0
r2evans On

If you read ?read_csv2, it says that

                            ‘read_csv2()’ uses ; for the field separator
     and , for the decimal point.

Similar to the base equivalent functions (read.csv and read.csv2), the readr functions read_csv and read_csv require you to use the ,/. and ;/, options for separators and decimal points.

If you want to change that behavior, you need to use read.table or readr::read_delim.

Demonstration:

readr::read_csv2("a;b\n1;2.000\n3;4")
# ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Rows: 2 Columns: 2
# ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ";"
# dbl (1): a
# num (1): b
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 2 × 2
#       a     b
#   <dbl> <dbl>
# 1     1  2000
# 2     3     4
readr::read_delim("a;b\n1;2.000\n3;4", delim = ";")
# Rows: 2 Columns: 2
# ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ";"
# dbl (2): a, b
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 2 × 2
#       a     b
#   <dbl> <dbl>
# 1     1     2
# 2     3     4

And with your updated data,

library(dplyr) # not strictly required but useful here
readr::read_delim("data.csv", delim=";") |>
  mutate(across(where(is.character),
                ~ type.convert(gsub("([0-9]),([0-9]{3})", "\\1\\2", trimws(.x)), as.is = TRUE))) |>
  rename_with(.fn = trimws)
# Rows: 12 Columns: 2
# ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ";"
# chr (2):  product ,  count
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 12 × 2
#    product   count
#    <chr>     <int>
#  1 product1   1085
#  2 product2    205
#  3 product3    770
#  4 product4     25
#  5 product5     50
#  6 product6     NA
#  7 product7     NA
#  8 product8   3382
#  9 product9   1152
# 10 product10    NA
# 11 product11    NA
# 12 product12   140