Read full collection through spark mongo connector with sequential disk access?

405 Views Asked by At

I want to read a full MongoDB collection into Spark using the Mongo Spark connector (Scala API) as efficiently as possible in terms of disk I/O.

After reading the connector docs and code, I understand that the partitioners are all designed to compute the minimum and maximum boundaries of an indexed field. My understanding is (and my tests using explain show) that each cursor will scan the index for document keys within the computed boundaries and then fetch the corresponding documents.

My concern is that this index-scan approach will result in random disk reads, and ultimately more I/Ops then necessary. In my case, the problem is accentuated because the collection is larger than available RAM (I know that's not recommended). Wouldn't it be orders of magnitudes faster to use a natural order cursor to read the documents as they are stored on disk? How can I accomplish this?

0

There are 0 best solutions below