I'd like to read a remote archive file with vroom and get a additional column with the filenames instead of archive name. Is this possible with vroom without the local archive_extract step as shown in the example below?
Thank you
library(tidyverse)
library(archive)
library(vroom)
file <- "ftp://opendata.dwd.de/climate_environment/CDC/grids_germany/daily/regnie/ra2021m.tar"
test1 <- vroom_fwf(file, col_positions = fwf_widths(rep(4, 611)),
col_types = , cols(.default = col_integer()),
na = "-999", id = "filename")
test1$filename %>% unique()
#> [1] "ftp://opendata.dwd.de/climate_environment/CDC/grids_germany/daily/regnie/ra2021m.tar"
my_dir <- fs::file_temp() %>% fs::dir_create()
archive_extract(file, dir = my_dir)
test2 <- fs::dir_ls(my_dir) %>%
vroom_fwf( col_positions = fwf_widths(rep(4, 611)),
col_types = , cols(.default = col_integer()),
na = "-999", id = "filename")
test2$filename %>% unique()
#> [1] ".../AppData/Local/Temp/Rtmp2TTpuI/filebfd82b6b1f6/ra210101.gz"
#> [2] ".../AppData/Local/Temp/Rtmp2TTpuI/filebfd82b6b1f6/ra210102.gz"
#> [3] ".../AppData/Local/Temp/Rtmp2TTpuI/filebfd82b6b1f6/ra210103.gz"
...
Created on 2022-07-25 by the reprex package (v2.0.1)
This is what the vroom vignette suggests:
Adapted to your use case, this gives something like:
However, this is slow (and crashes more often than not with my poor internet connection) because, as mentioned here,
untar
needs to read the whole archive in order to get the file names.Hence, I suppose, your question.
One way to avoid this is to use
archive_read
with an index position.This does not give you the exact file names however, but at least an index which allows you to differentiate them. This is the only improvement over your current implementation.
Since you might not know the number of files prior to reading, you may want to include error management in your function.
Is this faster or better than your current approach? I let you decide, but I wouldn't say so.
I would keep extracting the individual files, as reading names without the whole thing unfortunately seems to be a limitation of the
tar
format.