Highly scalable tags on Google App Engine (Python)

1.2k Views Asked by At

I have a lot of (e.g.) posts, that marked with one or more tags. Post can be created or deleted, and also user can make search request for one or more tags (combined with logical AND). First idea that came to my mind was a simple model

class Post(db.Model):
  #blahblah
  tags = db.StringListProperty()

Implementation of create and delete operations is obvious. Search is more complex. To search for N tags it will do N GQL queries like "SELECT * FROM Post WHERE tags = :1" and merge the results using the cursors, and it has terrible performance.

Second idea is to separate tags in different entities

class Post(db.Model):
    #blahblah
    tags = db.ListProperty(db.Key) # For fast access

class Tag(db.Model):
    name = db.StringProperty(name="key")
    posts = db.ListProperty(db.Key) # List of posts that marked with tag

It takes Tags from db by key (much faster than take it by GQL) and merge it in memory, I think this implementation has a better performance than the first one, but very frequently usable tags can exceed maximal size that allowed for single datastore object. And there is another problem: datastore can modify one single object only ~1/sec, so for frequently usable tags we also have a bottleneck with modify latency.

Any suggestions?

2

There are 2 best solutions below

1
On

Probably a possible solution is to take your second example, and modify it in a way that would permit efficient queries on larger sets. One way that springs to mind is to use multiple database entities for a single tag, and group them in such a way as you would seldom need to get more than a few groups. If the default sort order (well lets just call it the only permitted) is by post-date, then fill the tag group entities in that order.

class Tag(db.Model):
    name = db.StringProperty(name="key")
    posts = db.ListProperty(db.Key) # List of posts that marked with tag
    firstpost = db.DateTimeProperty()

When adding or removing tags to a group, check to see how many posts are in that group, if the post you are adding would make the post have more than, say 100 posts, split it into two tag groups. If you are removing a post so that the group would have fewer than 50 posts, steal some posts from a previous or next group. If one of the adjacent groups has 50 posts also, just merge them together. When listing posts by tag (in post-date order), you need only get a handful of groups.

That doesn't really resolve the high-demand tag problem.

Thinking about it, it might be okay for inserts to be a bit more speculative. Get the latest tag group entries, merge them and place a new tag group. The lag in the transactions might actually not be a real problem.

0
On

To further Nick's questioning. If it is a logical AND using multiple tags in they query. Use tags = tag1 AND tags = tag2 ... set membership in a single query is one of datastore's shining features. You can achieve your result in one query.

http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html#Properties_With_Multiple_Values