Consider a hypothetical HBase table.
- The key must encode a 3-tuple
(k, m, n)
of integers between 0 and 1000. - The typical read is a range query over
m
andn
, fixing a value ofk
. - The read load is exponentially distributed with respect to
k
. In other words, a few values ofk
are responsible for most of the read load.
Alice argues that the key should look like "k-m-n"
in order to exploit locality of reference. Ideally, a single machine should be able to serve an entire query.
Bob argues that the key should look like "sha1(k-m)-n"
in order to avoid hotspotting: if k=1
is accessed extremely often, then it would be wise for all the k=1
records to not all be on the same few machines.
Both arguments make sense to me. How do I figure out which option is more scalable/future-proof? Is there a quick, practical way to test this empirically?