Disk-persisted-lazy-cacheable-List ™ in Scala

1k Views Asked by At

I need to have a very, very long list of pairs (X, Y) in Scala. So big it will not fit in memory (but fits nicely on a disk).

  • All update operations are cons (head appends).
  • All read accesses start in the head, and orderly traverses the list until it finds a pre-determined pair.
  • A cache would be great, since most read accesses will keep the same data over and over.

So, this is basically a "disk-persisted-lazy-cacheable-List" ™

Any ideas on how to get one before I start to roll out my own?


Addendum: yes.. mongodb, or any other non-embeddable resource, is an overkill. If you are interested in a specific use-case for this, see the class Timeline here. Basically, I which to have a very, very big timeline (millions of pairs throughout months), although my matches only need to touch the last hours.

4

There are 4 best solutions below

2
On

The easiest way to do something like this is to extend Traversable. You only have to define foreach, and you have full control over the traversal, so you can do things like open and close the file.

You can also extend Iterable, which requires defining iterator and, of course, returning some sort of Iterator. In this case, you'd probably create an Iterator for the disk data, but it's going to be much harder to control things like open files.

Here's one example of a Traversable such as I described, written by Josh Suereth:

class FileLinesTraversable(file: java.io.File) extends Traversable[String] {
  override def foreach[U](f: String => U): Unit = {
     val in = new java.io.BufferedReader(new java.io.FileReader(file))
     try {
       def loop(): Unit = in.readLine match {
          case null => ()
          case line => f(line); loop()
       }
       loop()
     } finally {
       in.close()
     }
  }
}
0
On

You write:

mongodb, or any other non-embeddable resource, is an overkill

Do you know that there are embeddable database engines, including some really small ones? If you know, I'm not sure about your exact requirement and why would you not use them.

You sure that Hibernate + an embeddable DB (say SQLite) would not be enough? Alternatively, BerkeleyDB Java Edition, HSQLDB, or other embedded databases could be an option.

If you do not perform queries on the object themselves (and it really sounds like you do not), maybe serialization would be simpler than object-relational mapping for complex objects, but I've never tried, and I don't know which would be faster. But serialization is probably the only way to be completely generic in the type, assuming that your framework of choice offers a suitable interface to write [T <: Serializable]. If not, you could write [T: MySerializable] after creating your own "type-class" MySerializable[T] (like for instance Ordering[T] in the Scala standard library).

However, you don't want to use standard Java serialization for this task. "Anything serializable" sounds a bad requirement because it suggests the use of serialization for this, but I guess you can relax that to "anything serializable with my framework of choice". Serialization is extremely inefficient in time and space and is not designed to serialize a single object, instead it gives you back a file complete with special headers. I would suggest to use some different serialization framework - have a look here for a comparison.

Additional reasons not to go on the road of a custom implementation

In addition, it sounds like you would be reading the file essentially backward, and that's a quite bad access pattern, performance-wise, on non-SSD disks: after reading a sector, it takes an almost complete disk rotation to access the previous one.

Moreover, as Chris Shain pointed out in the comment above, you'd need to use a page-based solution, and you'd need to cope with variable-sized objects.

3
On

These Java libraries may contain what you need. They aim to store entries more efficiently than standard Java collections.

  • github.com/OpenHFT/Chronicle-Queue
  • github.com/OpenHFT/Chronicle-Map
1
On

If you don't want to step up to one of the embeddable DBs, how about a stack in memory mapped files?

  • A stack seems to meet your desired access characteristics. (Push a bunch of data, and iterate over the most recently pushed data frequently)
  • You can use Java's MappedByteBuffer directly from Scala. You get to address the file like its memory, without trying to actually load the file into memory.
  • You'd get some caching for free from the OS this way, since the mapped file would function like virtual memory. Recently written/accessed pages would stay in the OSs file cache until the OS saw fit to flush them (or you flushed them manually) back to disk
  • You could build your stack from either end of the file if you're worried about sequential read performance, but if you're usually reading data you just wrote I wouldn't expect that would be a problem since it will still be in memory. (Though if you're reading data that youve written over hours/days across pages then it might be a problem)
  • A file addressed in this way is limited in size to 2GB even on a 64 bit JVM, but you can use multiple files to overcome this limitation.