I've found some advice for setting up tagging systems in relational and document databases, but nothing for graph/multi-model databases.
I am trying to set up a tagging system for documents (let's call them "articles") in ArangoDB. I can think of two obvious ways to store tags in a multi-model (graph+document) database like Arango:
- as an array within each article document (document database-style)
- as a separate document class with each tag as a unique document and edges connecting tag documents to the article documents (something closer to relational database-style)
Are these in fact the two main ways to do this? Neither seems ideal. For example:
- If I'm storing tags within each article document, I can index the tags and presumably ArangoDB is optimizing the space they use. However, I can't use graph features to link or traverse tags (or I have to do it separately).
- If I'm storing tags as separate tag documents, it seems like extra overhead (an extra query) when I just want to get a list of tags on a document.
Which leads me to an explicit question: with regard to the latter option, is there any simple way to automatically make connected 'tag' documents show up within the article documents? E.g. have an array property that somehow 'mirrored' the tag.name
properties of the connected tag documents?
General advice is also welcome.
@Joachim Bøggild linked to Mike Williamson: https://mikewilliamson.wordpress.com/2015/07/16/data-modeling-with-arangodb/
I would agree with Williamson that "Compact by default" is generally the way to go. You can then extract vertices (aka. nodes) from properties if/when the actual need emerges. It also avoids creating an overly interconnected graph structure which would be slow for all kinds of traversal queries.
However, in this case, I think having Tag vertices (i.e. "documents", in your terminology) is good to have, because you can then store meta-data on the tag (like count), and connect it to other tags and sub-tags. It seems very useful and immediately foreseeable in the particular case of tags. Having a vertex, which you can add more relationships to if/when you need them, is also very extensible, so you keep your future options more open (more easily, at least).
It seems Williamson agrees that Tags warrant special consideration:
The original question by @ropeladder poses the main objection that it would require extra overhead (an extra query). I think it might be premature optimization to think too much about performance at this stage. After all; the extra query might be fast, or it might actually be joined with and included in the original query. In any case, I would quote this:
See also this article on some anti-patterns (dense vs sparse graphs), to supplement Williamsons points: https://neo4j.com/blog/dark-side-neo4j-worst-practices/
Extra section included for completeness, to those who want to dive a little bit deeper into this question:
Answering Williamson's own criteria for deciding whether something should be a vertex/node on its own, instead of leaving it as a property on the document vertex:
Yes. Browsing tags available in the system could be useful.
Unsure. Likely not.
Yes, probably. A user could edit it separately. Maybe an admin/moderator wants to clean up the tag names (correct spelling errors), or clean up their structure (if you have sub-tags).
Yes. They could. Sub-tags, or other kinds of content than merely documents. Actually, it's very useful to be able to click a tag and immediately see all documents with that tag. That would presumably be sub-optimal with tags stored as a property array on each document. Whereas a graph database is fundamentally optimized for the case of querying vertices adjacent to other vertices (aka. nodes).
Yes. A tag could/should exist even if the last tagged document was deleted. Someone might want to use that tag later on, and it represents domain information you might want to preserve.