The basic data model consists of 3 objects:
A: ObjectId id, String name, List of B objects
B: ObjectId id, String name
C: ObjectId id, reference to B object
Objects of type A
and B
will get created very infrequently (maybe one of those objects per month). And every A
object will contain only a few (= 1-5) B
objects. If I use embedding it would look like this:
{
id : ObjectId(...),
name : "some A name",
bList : [
{
id : ObjectId(...),
name : "some B name"
},{
id : ObjectId(...),
name : "some other B name"
}
]
}
Objects of type C will always look like that:
{
id : ObjectId(...),
bReference : ObjectId(...)
}
I can't embed C into B, because there can be an infinite (= millions of entries connected to a single B object) number of C objects.
A and B objects will get modified maybe once per month. C objects will never get modified. But C objects will get inserted very often (peak load should be about 20 inserts per second).
The problem/question:
Everytime I query for an document of type C, I need some information from the referenced B document (and some times additionally from the A document). For example I need to check, if I am allowed to access the C object(s):
c = db.c.find({ ... })
a = db.a.find({ bList.id : c.bReference })
// do something with the access permission information from within a.
If I need information from A and B this seems pretty good. Right?
But what if I only need information from the B document (which is pretty often the case)?
c = db.c.find({ ... })
b = db.a.find({ bList.id : c.bReference }, { bList.$ : 1 })
If I query for multiple C objects, they will always(!) reference the same B object. When using non-embedded database design, the query would look like that:
c = db.c.find({ ... })
b = db.b.find({ _id : c.bReference })
The most important question is: If I do this veeeeery frequently, will there be a huge performance difference between those two variants?
Bonus question:
Can I have a Unique constraint across all A objects? I want to have all those a.bList.id ObjectIds to be unique (I know they should(!) be globally unique, but having a constraint on DB level makes me feel more secure)
The answer to your first question is, as usually: it depends. If the objects are large (say hundreds of k each), there could be a slight performance hit. If the objects are relatively small, there shouldn't be a big difference.
Of course, 'slight', 'large', 'small', etc. are all very vague, so let me elaborate: Since there are very few
A
andB
objects and they are queried very often, they will probably remain in RAM which is good, so the only potential bottleneck are the serializer/deserializer and network. But let's say you're doing 1000 queries per second and the objects are 1k in size then this is 1MB/s - still not a particularly high load.If the objects are 100k in size we'd be talking about 100MB/s or roughly 1Gbit/s which is a completely different game - certainly not a high load for a powerful server, but it will likely choke a $20/m virtual private server. Then again, even at that rate, depending on your consistency constraints you might want to cache those objects for a second or two, dramatically reducing the network load for your reads.
In a nutshell, at 20 operations per second, there's not much to worry about.
As you can see from my other answers here, I very often advice against embedding, but since that data is modified very rarely, embedding should be fine.