Pre-filtering with pipe connections and vroom

Question

Pre-filtering with pipe connections and vroom

131 Views Asked by dzegpi At 16 August 2023 at 17:17

I want to read a large .txt file into R using the vroom package, because is fast and supports pipe connections for pre-filtering.

For reproducibility, let's read this UK cats csv file from the Tidy Tuesday project and pre-filter for id == "Ares". The first column corresponds to the tag_id.

The following code returns an empty dataframe. How to fix the filter and what changes are required to filter by regular expressions instead of == "Ares"?

cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"

vroom(
  file = pipe(paste("awk -F ',' '{ if ($1 == 'Ares') { print } }'", cats_file)),
  delim = ","
)

Original Q&A

There are 2 best solutions below

**markp-fuso** · Answer 1 · 2023-08-16T17:49:42.347000

Inside an awk script literal string values need to be wrapped in double quotes, eg:

($1 == "Ares")

Single quotes are used to delimit awk script/code; in this case awk sees 3 chunks of script/code:

{ if ($1 == +
Ares +
) { print }}

which awk concatenates into:

{ if ($1 == Ares) { print } }

This translates into awk comparing $1 to whatever's in the variable named Ares which in this case is undefined (aka empty string) so $1 == <empty_string> fails and nothing is printed.

I'm assuming you would need to escape the embedded double quotes, eg:

file = pipe(paste("awk -F ',' '{ if ($1 == \"Ares\") { print } }'", cats_file)),
                                           ^^    ^^

NOTE: I don't work with r/vroom so I'm assuming the rest of OP's code should work once the awk script is modified.

As Ed Morton has mentioned in comments the following should also work:

awk -F ',' '{ if ($1 ~ /Ares/) { print } }'

file = pipe(paste("awk -F ',' '{ if ($1 ~ /Ares/) { print } }'", cats_file)),

####
#### or
####

awk -F ',' '$1 ~ /Ares/'

file = pipe(paste("awk -F ',' '$1 ~ /Ares/`", cats_file)),

####
#### or
####

awk -F ',' '$1 == "Ares"'

file = pipe(paste("awk -F ',' '$1 == \"Ares\"'", cats_file)),

**G. Grothendieck** · Answer 2 · 2023-08-16T19:46:37.327000

1) The file linked to is not so large that you need to do anything special. Try this.

library(dplyr)
library(vroom)

Ares <- cats_file |>
  vroom() |>
  filter(tag_id == "Ares")

2) This also works

Ares <- cats_file |>
  vroom() |>
  filter(grepl("^Ares$", tag_id))

3) This works too and uses no packages:

Ares <- cats_file |>
  read.csv() |>
  subset(tag_id == "Ares")

4) Another approach if all the Ares rows are together (which it appears is the case) is to just read in the first column and then use that to pick out the rows needed.

tag_id <- read.csv(cats_file, colClasses = c(NA, rep("NULL", 10)))
rng <- range(grep("Ares", tag_id$tag_id))
Ares <- read.csv(cats_file, skip = rng[1] - 1, nrows = diff(rng) + 1)

An alternative if we knew that the tag_id field is first is to replace the tag_id line above with

tag_id <- read_csv(cats_file, comment_char = ",")

Pre-filtering with pipe connections and vroom

There are 2 best solutions below

Related Questions in R

Related Questions in AWK

Related Questions in VROOM

Trending Questions

Popular # Hahtags

Popular Questions