R+arrow: reading double with comma decimal separator

64 Views Asked by At

Please have a look at the code at the end of the post. You can download the input tsv file (nothing malicious!) from

https://e.pcloud.link/publink/show?code=XZ5eCWZdrwFuo5POSVzi7ywCmteHfE4rdmV

I am trying to convert a text file to a parquet file without loading it into memory. This fails because I have a tsv file where a comma "," is used as a decimal separator. Is there any way to fix my code without changing the input file?

Thanks!

library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:lubridate':
#> 
#>     duration
#> The following object is masked from 'package:utils':
#> 
#>     timestamp



data <- open_dataset("test.tsv",
  format = "tsv",
  skip_rows = 1, 
  schema = schema(
    AID_MEASURE_ID = string(), 
    DATE_CREATED = string(), 
    DATE_GRANTED = string(), 
    AA_PUBLISHED_DATE = string(), 
    SERVER_REF = string(), 
    AM_TITLE = string(), 
    AM_TITLE_EN = string(), 
    STATUS = string(), 
    AM_PROC_TYPE_CD = string(), 
    COFINANCE = string(), 
    OBJECTIVE = string(), 
    OTHER_OBJECTIVE_EN = string(), 
    AID_INSTRUMENT = string(), 
    OTHER_AID_INSTRUMENT_EN = string(), 
    BENEFICIARY_NAME = string(), 
    BENEFICIARY_NAME_ENGLISH = string(), 
    BENEFICIARY_NATIONAL_ID = string(), 
    BENEFICIARY_NAT_ID_TYPE_SD = string(), 
    BENEFICIARY_TYPE_SD = string(), 
    COUNTRY_SD = string(), 
    REGION_SD = string(), 
    SECTOR_SD = string(), 
    GRANTED_AMOUNT_FROM_EUR = double(), 
    NOMINAL_AMOUNT_EUR_FROM = double(), 
    GRANT_RANGE = string(),
    GRANTED_AMOUNT_RANGE_DESC=string(),
    GRANTING_AUTHORITY_NAME = string(), 
    GRANTING_AUTHORITY_NAME_EN = string(), 
    NUTS_CD = string(), 
    GRANTING_AUTHORITY_COUNTRY = string()
  )
)


write_dataset(
  data,
  format = "parquet",
  path = ".",
  max_rows_per_file = 1e7
)
#> Error: Invalid: Could not open CSV input source '/home/lorenzo/mega_pcloud/work/COMP/stat_support/tam_arrow/new_test/test.tsv': Invalid: In CSV column #22: Row #5: CSV conversion error to double: invalid value '631135,74'

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] arrow_13.0.0.1  lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
#>  [5] dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
#>  [9] tibble_3.2.1    ggplot2_3.4.2   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5         gtable_0.3.3      compiler_4.3.1    reprex_2.0.2     
#>  [5] tidyselect_1.2.0  assertthat_0.2.1  scales_1.2.1      yaml_2.3.7       
#>  [9] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    knitr_1.43       
#> [13] munsell_0.5.0     R.cache_0.16.0    tzdb_0.4.0        pillar_1.9.0     
#> [17] R.utils_2.12.2    rlang_1.1.1       utf8_1.2.3        stringi_1.7.12   
#> [21] xfun_0.39         fs_1.6.2          bit64_4.0.5       timechange_0.2.0 
#> [25] cli_3.6.1         withr_2.5.0       magrittr_2.0.3    digest_0.6.31    
#> [29] grid_4.3.1        hms_1.1.3         lifecycle_1.0.3   R.methodsS3_1.8.2
#> [33] R.oo_1.25.0       vctrs_0.6.2       evaluate_0.21     glue_1.6.2       
#> [37] styler_1.10.1     fansi_1.0.4       colorspace_2.1-0  rmarkdown_2.22   
#> [41] tools_4.3.1       pkgconfig_2.0.3   htmltools_0.5.5

Created on 2023-10-03 with reprex v2.0.2

1

There are 1 best solutions below

0
thisisnic On

I'm afraid that the decimal_point argument from the C++ CSV ConvertOptions class hasn't yet been exposed in the R bindings, which is causing complications here. I've opened a ticket and started a PR to implement this, which has a decent chance of being merged before the next release.