Disk-backed STL container classes?

6.4k Views Asked by At

I enjoy developing algorithms using the STL, however, I have this recurring problem where my data sets are too large for the heap.

I have been searching for drop-in replacements for STL containers and algorithms which are disk-backed, i.e. the data structures on stored on disk rather than the heap.

A friend recently pointed me towards stxxl. Before I get too involved with it... Are any other disk-backed STL replacements available that I should be considering?

NOTE: I'm not interested in persistence or embedded databases. Please don't mention boost::serialization, POST++, Relational Template Library, Berkeley DB, sqlite, etc. I am aware of these projects and use them when they are appropriate for my purposes.

UPDATE: Several people have mentioned memory-mapping files and using a custom allocator, good suggestions BTW, but I would point them to the discussion here where David Abraham suggests that custom iterators would be needed for disk-backed containers. Meaning the custom allocator approach isn't likely to work.

4

There are 4 best solutions below

3
On BEST ANSWER

I have implemented some thing very similar. Implementing the iterators is the most challenging. I used boost::iterator_facade to implement the iterators. Using boost::iterator_facade you can easy adapt any cached on disk data structures to have a STL container interface.

3
On

I don't know much about the subject, but it might be possible to write an STL-like interface to a memory mapped file?

edit: This approach might be suitable if you're trying to get at a specific part of a huge file. If you're attempting to do something with the entire file, you'll likely generate a huge number of page faults as you read in uncached parts of the file.

3
On

If (as you write) you're not interested in persistence the simplest solution would be to increase your heap size and use your operating system's virtual memory facilities. The part of the heap that will not fit in your computer's physical memory will end up being paged on disk, giving you exactly what you want: normal STL access to data often stored on disk. The operating system will take care of caching the most used pages in the physical memory and evicting to disk those you don't use a lot. Your code will remain the same, and you can increase its performance simply by adding more physical memory.

To increase your heap size check your operating system's parameters, like ulimit(1) on Unix systems and System properties - Advanced - Performance - Advanced - Virtual Memory on Windows XP. If you've hit the 32-bit 4GB limit consider moving to a 64 bit architecture or compiling your program for 64 bits.

1
On

I've never had to do anything quite like this, but It might be possible to do what you want to do by writing a custom allocator that makes use of a memory mapped files to back your data.

See boost::interprocesses for docs on their easy to use implementation of memory mapped files, this Dr. Dobbs article for a detailed discussion on writing allocators, and this IEEE Software column for a description of the problem and example code.