Does TiDB's range based sharding nature have actual read workload benefits?

80 Views Asked by At

I understand that TiDB, compared to DynamoDB, chose range-based sharding rather than hash-based sharding. Now, one explanation I learned is that by using range-based sharding, you can do where statements such as id > 4 AND id < 10 and have good data locality and support table scan read workloads. However, what I don't understand is, even so, in order to support high concurrency writes, we need to scatter data and do things such as auto random id. Then, in the end, the data that was written around the same time, which are typically things we want to query together, cannot be in the same region. In other words, we seem to have to either choose to optimize for write and do read scatter-gather, or optimize for read which only scans a few regions but causes a write hot spot.

2

There are 2 best solutions below

1
On

As TiDB fetches data in parallel it doesn't matter too much if the data is in a single region or in multiple regions. What matters more is a good index.

Another tool that you can use is partitioning.

Disclaimer: I'm working for PingCAP, the company behind TiDB.

1
On

First things first, TiDB does not use the sharding plan. The region plan is totally different compared to the sharding plan. The sharding plan is a plan that is information lossy. For example, in the sharding plan, you can not build a global unique secondary index without a prefix sharding key as the composite index.

IMO, TiDB chose the range-based region, that because it can handle both scenarios of write-intensive and read-intensive.

When you want to optimize for reading, you can choose to use range-based regions to organize regions. But if you encounter a heavy workload for writing, you can also make a random cluster key to gain maximum writing ability.

However, if you have built a hash-based region, you can not get a read-optimized table.