I tried to import data from mongodb to r using:
mongo.find.all(mongo, namespace, query=query,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ), data.frame= T)
The command works find for small data sets, but I want to import 1,000,000 documents.
Using system.time and adding limit= X to the command, I measure the time as a function of the data to import:
system.time(mongo.find.all(mongo, namespace, query=query ,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ),
limit= 10000, data.frame= T))
The results:
Data Size Time
1 0.02
100 0.29
1000 2.51
5000 16.47
10000 20.41
50000 193.36
100000 743.74
200000 2828.33
After plotting the data I believe that: Import Time = f(Data^2)
Time = -138.3643 + 0.0067807*Data Size + 6.773e-8*(Data Size-45762.6)^2
R^2 = 0.999997
- Am I correct?
- Is there a faster command?
Thanks!
lm
is cool, but I think if you'll try to add power 3,4,5, ... features, you'll also receive great R^2 =) you overfit=)One of the known R's drawbacks is that you can't efficiently append elements to
vector
(orlist
). Appending element triggers copy of the entire object. And here you can see derivative of this effect. In general when you fetching data from mongodb, you don't know size of the result in advance. You iterate though cursor and grow resulting list. In older versions this procedure was incredibly slow because of described above R's behaviour. After this pull performance become much better. Trick withenvironment
s helps a lot, but it still not as fast as preallocated list.But can we potentially do better? Yes.
1) Simply allow user to point size of the result and preallocate list. And do it automatically if
limit=
is passed intomongo.find.all
. I filled issue for this enhancement.2) Construct result in
C
code.If know size of your data in advance you can: