How to make lapply faster in a database with more than 30 millions rows?

96 Views Asked by At

I have a very big database (more than 30 millions records) with the following structure:

test <- as.data.table(list(v1 = c("150,10001,11,Bien", "151,10002,11,Bien", 
"152,10003,11,Bien", "153,10004,11,Mal", 
"154,10005,11,Regular")))

I downloaded that db as one string per row since I miss a lot of information trying to read using fread straightforward. For instance, I get errors similar to this: "Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line" Because of that I can't download the whole database as expected.

What I want to achieve is to create four variables from those strings.This code works perfectly in my dummy database:

test <- test[, `:=`(id = lapply(strsplit(test$v1, split=","), "[", 1),
                              question_id = lapply(strsplit(test$v1, split=","), "[", 2),
                              answer_id = lapply(strsplit(test$v1, split=","), "[", 3),
                              answer = lapply(strsplit(test$v1, split=","), "[", 4))]

In my actual database, this code is taking a lot of time. I waited more than two hours and didn't get any result.I guess this approach might be wrong since I'm not creating variables themselves but lists so that makes the process slower than it should be:

Classes ‘data.table’ and 'data.frame':  5 obs. of  5 variables:
 $ v1         : chr  "150,10001,11,Bien" "151,10002,11,Bien" "152,10003,11,Bien" "153,10004,11,Mal" ...
 $ id         :List of 5
  ..$ : chr "150"
  ..$ : chr "151"
  ..$ : chr "152"
  ..$ : chr "153"
  ..$ : chr "154"
 $ question_id:List of 5
  ..$ : chr "10001"
  ..$ : chr "10002"
  ..$ : chr "10003"
  ..$ : chr "10004"
  ..$ : chr "10005"
 $ answer_id  :List of 5
  ..$ : chr "11"
  ..$ : chr "11"
  ..$ : chr "11"
  ..$ : chr "11"
  ..$ : chr "11"
 $ answer     :List of 5
  ..$ : chr "Bien"
  ..$ : chr "Bien"
  ..$ : chr "Bien"
  ..$ : chr "Mal"
  ..$ : chr "Regular"
 - attr(*, ".internal.selfref")=<externalptr>

I read that parLapply (a function from parallel library) tends to be slower than lapply when using Windows so I discarded that solution. Any advice will be much appreciated.

1

There are 1 best solutions below

0
r2evans On

First, calling tstrsplit multiple times on the same strings just to get to specific columns can be avoided:

test <- as.data.table(list(v1 = c("150,10001,11,Bien", "151,10002,11,Bien","152,10003,11,Bien", "153,10004,11,Mal","154,10005,11,Regular")))
test[, c("id","question_id","answer_id","answer") := tstrsplit(v1, ",")][]
#                      v1     id question_id answer_id  answer
#                  <char> <char>      <char>    <char>  <char>
# 1:    150,10001,11,Bien    150       10001        11    Bien
# 2:    151,10002,11,Bien    151       10002        11    Bien
# 3:    152,10003,11,Bien    152       10003        11    Bien
# 4:     153,10004,11,Mal    153       10004        11     Mal
# 5: 154,10005,11,Regular    154       10005        11 Regular

Just this change along is 4x faster with this small dataset, and with 1000x as many rows, it actually was more-efficient in limited testing.

(Another comment: using test$v1 while inside a test calculation is inefficient and might be slowing you down a little. Perhaps 1% or so ... not much.)

Since you say you have rows with too many commas, then this will still fail. For those, from your current (too-slow) method that you only want the first four elements of each string, so we can remove it first. I'll modify the sample data so that we can see the effect here.

test <- as.data.table(list(v1 = c("150,10001,11,Bien", "151,10002,11,Bien","152,10003,11,Bien", "153,10004,11,Mal","154,10005,11,Regular")))

test[4, v1 := paste0(v1, ",quux")]
test[, c("id","question_id","answer_id","answer") := tstrsplit(v1, ",")][]
# Error in `[.data.table`(test, , `:=`(c("id", "question_id", "answer_id",  : 
#   Supplied 4 columns to be assigned 5 items. Please see NEWS for v1.12.2.

Your currently (too-slow) method suggests that you only want the first four and are willing to discard everything after it, so I'll suggest a conditional sub.

test[nchar(gsub("[^,]+", "", v1)) > 3, v1 := sub(",[^,]*$", "", v1)]
test[, c("id","question_id","answer_id","answer") := tstrsplit(v1, ",")][]
#                      v1     id question_id answer_id  answer
#                  <char> <char>      <char>    <char>  <char>
# 1:    150,10001,11,Bien    150       10001        11    Bien
# 2:    151,10002,11,Bien    151       10002        11    Bien
# 3:    152,10003,11,Bien    152       10003        11    Bien
# 4:     153,10004,11,Mal    153       10004        11     Mal
# 5: 154,10005,11,Regular    154       10005        11 Regular

However, I think all of that is unnecessary, since you can instruct fread to fill when extract columns are discovered (as the warning/error message suggests).

fread("quux.csv")
# Warning in fread("quux.csv") :
#   Stopped early on line 4. Expected 4 fields but found 5. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<153,10004,11,Mal,quux>>
#       V1    V2    V3     V4
#    <int> <int> <int> <char>
# 1:   150 10001    11   Bien
# 2:   151 10002    11   Bien
# 3:   152 10003    11   Bien
fread("quux.csv", fill = TRUE)
#       V1    V2    V3      V4     V5
#    <int> <int> <int>  <char> <char>
# 1:   150 10001    11    Bien       
# 2:   151 10002    11    Bien       
# 3:   152 10003    11    Bien       
# 4:   153 10004    11     Mal   quux
# 5:   154 10005    11 Regular       

This is certainly a safer path, since it allows you to assess if the extra columns are okay to discard, or if perhaps one of the other columns should have had an embedded comma in a quoted string.