long time to import data using mongo.find.all (rmongodb)

531 Views Asked by At

I tried to import data from mongodb to r using:

mongo.find.all(mongo, namespace, query=query,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ), data.frame= T)

The command works find for small data sets, but I want to import 1,000,000 documents.

Using system.time and adding limit= X to the command, I measure the time as a function of the data to import:

system.time(mongo.find.all(mongo, namespace, query=query ,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ),
limit= 10000, data.frame= T))

The results:

Data Size   Time
1           0.02
100         0.29
1000        2.51
5000        16.47
10000       20.41
50000       193.36
100000      743.74
200000      2828.33

After plotting the data I believe that: Import Time = f(Data^2)

Time = -138.3643 + 0.0067807*Data Size + 6.773e-8*(Data Size-45762.6)^2

R^2 = 0.999997

  1. Am I correct?
  2. Is there a faster command?

Thanks!

1

There are 1 best solutions below

3
On

lm is cool, but I think if you'll try to add power 3,4,5, ... features, you'll also receive great R^2 =) you overfit=)

One of the known R's drawbacks is that you can't efficiently append elements to vector (or list). Appending element triggers copy of the entire object. And here you can see derivative of this effect. In general when you fetching data from mongodb, you don't know size of the result in advance. You iterate though cursor and grow resulting list. In older versions this procedure was incredibly slow because of described above R's behaviour. After this pull performance become much better. Trick with environments helps a lot, but it still not as fast as preallocated list.

But can we potentially do better? Yes.

1) Simply allow user to point size of the result and preallocate list. And do it automatically if limit= is passed into mongo.find.all. I filled issue for this enhancement.
2) Construct result in C code.

If know size of your data in advance you can:

cursor <- mongo.find(mongo, namespace, query=query, fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ))
result_lst <- vector('list', NUMBER_OF_RECORDS)
i <- 1
while (mongo.cursor.next(cursor)) {
  result_lst[[i]] <- mongo.bson.to.list(mongo.cursor.value(cursor))
  i <- i + 1
}
result_dt <- data.table::rbindlist(result_lst)