Read the column specification col_types of readr::read_delim from file

107 Views Asked by At

How can I read the column specification col_types of the readr::read_delim function from a file?

Instead of

> read_csv(file = I('varInt,varChar,varFac\n
+                    1,a,A1\n
+                    2,b,A2\n
+                    3,c,A3'),
+          col_types = cols(varInt = 'i',
+                           varChar = 'c',
+                           varFac = col_factor(levels = c('A1', 'A2', 'A3'))))
# A tibble: 3 × 3                                                                                                                                                             
  varInt varChar varFac
   <int> <chr>   <fct> 
1      1 a       A1    
2      2 b       A2    
3      3 c       A3     

I want to do something like

mySpecFile <- read_csv(file = I("Variable,Spec\n
                                 varInt,i\n
                                 varChar,c\n
                                 varFac,col_factor(levels = c('A1'; 'A2'; 'A3'))"))

mySpec <- mySpecFile |> pull(Spec, Variable) |> as.list()

read_csv(file = I('varInt,varChar,varFac\n
                   1,a,A1\n
                   2,b,A2\n
                   3,c,A3'),
         col_types = mySpec)

But this throws: Error: Unknown shortcut: col_factor(levels = c('A1'; 'A2'; 'A3'))

So, specifying levels of factors does not work for me.

Seems to be related: R readr col_types specified in a metadata file, specifically using custom date formats

However, the readr::read_delim documentation says

One of NULL, a cols() specification, or a string. See vignette("readr") for more details.

If NULL, all column types will be inferred from guess_max rows of the input, interspersed throughout the file. This is convenient (and fast), but not robust. If the guessed types are wrong, you'll need to increase guess_max or supply the correct types yourself.

Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only().

Alternatively, you can use a compact string representation where each character represents one column:

1

There are 1 best solutions below

0
r2evans On BEST ANSWER

A few things:

  • The varFac spec is a string containing col_factor, not a call or expression (or the results of it). We can possibly evaluate it.

  • Your varFac,col_factor(levels = c('A1'; 'A2'; 'A3')) doesn't have a valid R expression, we need to replace ; with ,; this likely means the spec CSV needs to be ;-delimited (or something other than ,)

library(readr)
mySpecFile <- read_csv2(file = I("Variable;Spec\n
                                 varInt;i\n
                                 varChar;c\n
                                 varFac;col_factor(levels = c('A1', 'A2', 'A3'))"))
# ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# Rows: 3 Columns: 2
# ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ";"
# chr (2): Variable, Spec
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mySpec <- mySpecFile |>
  pull(Spec, Variable) |>
  as.list() |>
  lapply(function(z) if (nchar(z) > 1) tryCatch(eval(parse(text = z)), error = function(e) z) else z)
read_csv(file = I('varInt,varChar,varFac\n
                   1,a,A1\n
                   2,b,A2\n
                   3,c,A3'),
         col_types = mySpec)
# # A tibble: 3 × 3
#   varInt varChar varFac
#    <int> <chr>   <fct> 
# 1      1 a       A1    
# 2      2 b       A2    
# 3      3 c       A3    

The if (nchar(z) > 1) is to guard against "c" (for character) becoming an R function (and possibly other things). If you want more specificity, change that conditional to something else.

The tryCatch(.., error = function(e) z) ensures that if it is not an expression, it returns the original string.

As an alternative to using ;-delimited text, we can quote them (or just the one string) to protect the embedded commas we need.

mySpecFile <- read_csv(file = I("Variable,Spec\n
                                 varInt,i\n
                                 varChar,c\n
                                 varFac,\"col_factor(levels = c('A1', 'A2', 'A3'))\""))