Pre-filtering with pipe connections and vroom

131 Views Asked by At

I want to read a large .txt file into R using the vroom package, because is fast and supports pipe connections for pre-filtering.

For reproducibility, let's read this UK cats csv file from the Tidy Tuesday project and pre-filter for id == "Ares". The first column corresponds to the tag_id.

The following code returns an empty dataframe. How to fix the filter and what changes are required to filter by regular expressions instead of == "Ares"?

cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"

vroom(
  file = pipe(paste("awk -F ',' '{ if ($1 == 'Ares') { print } }'", cats_file)),
  delim = ","
)
2

There are 2 best solutions below

6
markp-fuso On

Inside an awk script literal string values need to be wrapped in double quotes, eg:

($1 == "Ares")

Single quotes are used to delimit awk script/code; in this case awk sees 3 chunks of script/code:

  • { if ($1 == +
  • Ares +
  • ) { print }}

which awk concatenates into:

{ if ($1 == Ares) { print } }

This translates into awk comparing $1 to whatever's in the variable named Ares which in this case is undefined (aka empty string) so $1 == <empty_string> fails and nothing is printed.

I'm assuming you would need to escape the embedded double quotes, eg:

file = pipe(paste("awk -F ',' '{ if ($1 == \"Ares\") { print } }'", cats_file)),
                                           ^^    ^^

NOTE: I don't work with r/vroom so I'm assuming the rest of OP's code should work once the awk script is modified.


As Ed Morton has mentioned in comments the following should also work:

awk -F ',' '{ if ($1 ~ /Ares/) { print } }'

file = pipe(paste("awk -F ',' '{ if ($1 ~ /Ares/) { print } }'", cats_file)),

####
#### or
####

awk -F ',' '$1 ~ /Ares/'

file = pipe(paste("awk -F ',' '$1 ~ /Ares/`", cats_file)),

####
#### or
####

awk -F ',' '$1 == "Ares"'

file = pipe(paste("awk -F ',' '$1 == \"Ares\"'", cats_file)),
2
G. Grothendieck On

1) The file linked to is not so large that you need to do anything special. Try this.

library(dplyr)
library(vroom)

Ares <- cats_file |>
  vroom() |>
  filter(tag_id == "Ares") 

2) This also works

Ares <- cats_file |>
  vroom() |>
  filter(grepl("^Ares$", tag_id))

3) This works too and uses no packages:

Ares <- cats_file |>
  read.csv() |>
  subset(tag_id == "Ares") 

4) Another approach if all the Ares rows are together (which it appears is the case) is to just read in the first column and then use that to pick out the rows needed.

tag_id <- read.csv(cats_file, colClasses = c(NA, rep("NULL", 10)))
rng <- range(grep("Ares", tag_id$tag_id))
Ares <- read.csv(cats_file, skip = rng[1] - 1, nrows = diff(rng) + 1)

An alternative if we knew that the tag_id field is first is to replace the tag_id line above with

tag_id <- read_csv(cats_file, comment_char = ",")