Sharding multiple tables with no common column

87 Views Asked by At

I wanted to understand sharding in case of multiple tables which might be used for QnA websites like Quora/SO. Let's assume that users can ask questions, give answers and comment on both questions and answers.

Suppose I have 3 tables (questions, answers, comments) and they have columns like this:

  1. questions -> question_id, user_id, text, timestamp
  2. answers -> answer_id, user_id, question_id, text, timestamp (can be extended differently in future so keeping separate tables for questions and answers)
  3. comments -> comment_id, user_id, parent_id, parent_type (enum with 2 values: 'question', 'answer'), timestamp

Assume that we are sharding a Postgres database. Now I want to optimise for viewing a page with a single question with all the answers and their comments.

One approach is to split the "questions" and "answers" tables based on the "question_id" column. This way, all the answers related to a particular question are stored in the same shard.

However, when dealing with the "comments" table I am unable to ensure that all comments for every answer to a question are stored in a single shard (avoiding the need for multi-shard queries when fetching all the comments for a question's page). Sharding by parent_id does not work since multiple comments on different answers to a question will have different shard in that case.

Is there a better database design where I would not face this problem? Any insight on how Quora or SO actually handle "comments" to avoid/solve this problem is greatly appreciated.

Resources I've looked at: SO schema Sharding at quora

1

There are 1 best solutions below

0
On

Don't think parent_id; instead think question_id versus question_id + answer_id. That way, the title question is eliminated -- you will have question_id in all 3 tables.

Comments need to be sharded, too. Split all 3 tables by question_id (or a hash of such). This assumes that Comments know the question_id that they are associated with.

Sharding will allow you to handle many more queries than without sharding. However, there will be maintenance issues.

If your system gets so busy that you need more data splitting, it may require some downtime to migrate some of the questions+answeres+comments to a new shard. On the other hand, since questions become stale over time, simply put new questions+answeres+comments on a new server and let the old servers fade away.

With that in mind, I suggest you start with a single shard and gradually grow to more shards.

The front-end web servers can easily be repeated on multiple servers. The will all be "identical". But they need to know the algorithm for getting from the question_id to the shard it is on. This should be rather straightforward, however when you add a new shard, the code (or table) that controls the redirection needs updating.