Dplyr filter with str_detect returns empty tibble

1.2k Views Asked by At

I have a dplyr query in R that filters with str_detect to only get the cases that begins with the letters "KS", but it returns an empty tibble. I´m connected to an Oracle database using ROracle.

table <- tbl(con, "TABLE")

table %>% 
  filter(str_detect(COLUMN, "^KS"))

If I however use collect() to generate the tibble, it works:

table <- collect(tbl(con, "TABLE"))

table %>% 
  filter(str_detect(COLUMN, "^KS"))

Why is that? And how can I get it working without the collect? Some of the tables I need are to large to collect.

Update: If I change it to filter for a specific column value, like this: table %>% filter(str_detect(COLUMN, "^KS")), it works. For some reason the regex doesn´t work without the collect().

2

There are 2 best solutions below

3
On BEST ANSWER

Richard Telford pointed me in the right direction with the link in the comments. It works if I instead use:

table <- tbl(con, "TABLE")

table %>% 
  filter(COLUMN %like% "%KS%")
3
On

It could be a problem with the translation of str_detect into a query to the data base :

b <- tbl(con,"pays")
filter(b,str_detect(nom,"^D")) %>% explain
#<SQL>
#SELECT *
#FROM "pays"
#WHERE (STRPOS("nom", '^D') > 0)


#<PLAN>
#Seq Scan on pays  (cost=0.00..5.31 rows=74 width=13)
#  Filter: (strpos(nom, '^D'::text) > 0)

b %>% filter(str_detect(nom, "^D")) %>% count
## Source:   lazy query [?? x 1]
## Database: postgres 9.6.1 [h2izgk@localhost:5432/postgres]
#      n
#  <dbl>
#1     0

Unfortunately STRPOS (I was using PostgreSQL) doesn't recognize the meaning of '^' and the query fails. So you should use another function, I found that `grepl was fine :

filter(b,grepl("^D",nom)) %>% explain
#<SQL>
#SELECT *
#FROM "pays"
#WHERE (("nom") ~ ('^D'))


#<PLAN>
#Seq Scan on pays  (cost=0.00..4.76 rows=54 width=13)
#  Filter: (nom ~ '^D'::text)

b %>% filter(grepl("^D", nom))
## Source:   lazy query [?? x 3]
## Database: postgres 9.6.1 [h2izgk@localhost:5432/postgres]
#  nom      pays  code 
#  <chr>    <chr> <chr>
#1 DANEMARK 101   208  
#2 DOUALA   ""    120  
#3 DAKAR    ""    686  
#4 DJIBOUTI 399   262

And the result is correct. In your example, collect solves the problem because it first downloads the whole table into the R memory and then appplies the str_detect without translating into SQL. But it is not efficient.