MongoDB: how to create versioned "snapshots" for multiple collections with cross-references

14 Views Asked by At

I have an application built on mongodb using mongoengine. Users will mostly interact with the "live" version of the database, but I need to have the ability to save snapshots of the current live state (that will be treated as read-only by the application after that point). Think of it like tags on a version control repo.

The application has multiple collections with cross-references. For simplicity, let's assume two simple collections:

from mongoengine import Document, ReferenceField, StringField, ListField

class Measurement(Document):
    name = StringField()

class Sample(Document):
    name = StringField()
    measurements = ListField(ReferenceField(Measurement))

What is the best way to create a tagged copy of all documents in all collections?

This is different from most questions about versioning in mongo, which are concerned with versioning individual documents. I want to keep versions of the state of the entire database. Approaches I've thought about:

1. Copy the entire database to a new one

This is probably the easiest, but it assumes I have write-permission to an arbitrary set of databases on the mongo server, which I can't guarantee

2. Copy each collection to a new one

So to make a tag "v1", copy all documents in the "measurement" collection to a new collection "measurement.v1". Either use getCollectionNames to discover the list of saved tags or use a separate collection for bookkeeping.

The biggest advantage here is that the ReferenceFields just store ObjectIds, so they wouldn't have to be updated. A drawback (also with #1) is that every tag will now be a complete copy, even if most of the contents haven't changed. My application logic has to determine what table to run queries against based on which tag the user is looking at.

3. Add a 'tags' field to every document

tags = ListField(StringField) To create a new tag, just push that string into the tags list for every 'current' document. Documents that didn't change since the last tag will have multiple tags and won't take up any extra space. But now how do I handle changes to the live state?

If I want to save an update to sampleA, I save it as a new document with a new ID, stripping out any tags. But if I want to save an update to measurementA, I also have to update every sample with a reference to it.

3b. Create a registry for tags

I.e. create a new collection with a document for each tag that somehow records the database state for that tag.

class TaggedVersion(Document);
    tag = StringField()
    measurements = ListField(ObjectIdField)
    samples = ListField(ObjectIdField)

This seems to have all the problems of #3 and also makes queries harder.

4. Use a complex ID field that includes the tag

class IdTag(EmbeddedDocument):
    id = ObjectIdField()
    tag = StringField()

class Measurement(Document):
    id = EmbeddedDocumentField(IdTag, primary_key=True)
    name = StringField()
...

Creating a new tagged version requires copying all documents with no tag and assigning the new tag string to BOTH the id.tag attribute AND the tag attribute of all ReferenceFields. After that point, the RefernceFields should "just work".

5. Something better?

What is the best approach, and what pitfalls will I run into?

I think none of the solutions I've come up with will naturally be atomic without employing transactions. That's OK at this stage; the application is single-user and so race conditions are a negligible concern.

0

There are 0 best solutions below