C++ file-backed, tree-like datastructure

135 Views Asked by At

I'm currently using google's protobuffer library to store and load data from disk. It's pretty convenient because it's fast, provides a nice way of defining my datastructures, allows to compress/decompress the data upon writing/reading from the file.

So far that served me well. The problem is now that I have to deal with a datastructure that is several hundred gigabytes large and with protobuf I can only write and load the whole file.

The datastructure looks something like that

struct Layer {
  std::vector<float> weights_;
  std::vector<size_t> indices_;
};

struct Cell {
  std::vector<Layer> layers_;
};

struct Data {
  int some_header_fields;
  ...
  std::vector<Cell> cells_;
};

There are two parts to the algorithm.

In the first part, data is added (not in sequence, the access pattern is random, weights and indices might be added to any layer of any cell). No data is removed.

In the second part, the algorithm accesses one cell at a time and processes the data in it, but the access order of cells is random.

What I'd like would be something similar to protobuf, but backed by some file storage that doesn't need to be serialized/deserialized in one go.

Something that would allow me to do things like

Data.cells_[i].layers_[j].FlushToDisk();

at which point the weights_ and indices_ arrays/lists would write their current data to the disk (and free the associated memory) but retain their indices, so that I can add more data to it as I go.

And later during the second part of the algorithm, I could do something like

Data.cells_[i].populate(); //now all data for cell i has been loaded into ram from the file
... process cell i...
Data.cells_[i].dispose();  //now all data for cell i is removed from memory but remains in the file

Additionally to store data to the disk, I'd like it to also support compression of data. It should also allow multithreaded access.

What library would enable me to do this? Or can I still use protobuf for this in some way? (I guess not, because I would not write the data to disk in a serialized fashion)

//edit: performance is very important. So when I populate the cell, I need the data to be in main memory and in continguous arrays

0

There are 0 best solutions below