Do I need to wait for background compaction to finish after creating test data to do a good read benchmark?

211 Views Asked by At

I am doing some benchmarks with RocksDB Java for my own application data and would like to be sure the created data is stored as optimally as possible before starting to measure read performance (i.e. if any background compaction etc. is going on during/after inserts I would like to wait for that to complete). Is this something I need to be concerned about and if so how can I programmatically know when it is ok to start my read benchmark?

2

There are 2 best solutions below

2
On

This is a tricky subject. The overly general answer is to test what you care about. If you care about system performance overall under a mixed read-write workload, that's probably what you should test. If you care about read performance under those conditions, then you should probably test under those conditions. (Note that RocksDB LOG file reports operation counts and latency statistics, though those don't include penalties associated with the Java layer.) However, it can require hours of testing under such chaotic conditions to get reliable data about one aspect of performance such as read latency or max throughput.

If you are willing to sacrifice some statistical validity for more statistical reliability (for faster accurate performance measurement) then you can run just the read path. As you note, you want to avoid background compactions in order to consistently isolate just the read path. For this I recommend re-opening the database as read-only and then performing your reads. Or you can wait for pending compactions to finish by periodically polling DB property kNumRunningCompactions until it is zero (perhaps several times in a row). This approach generally leaves the LSM in some random, average-ish state that reflects how reads will perform in an active read-write system, though the particular LSM state can vary considerably, so you might want to average over several such states.

The problem with running a full compaction before testing read performance is that your LSM will always be in an "optimized" state, so reads will be as fast as they can be. If your actual workload is always read-only after compaction, then by all means test this way, but it's considered to have low validity for real-world read performance for most workloads.

If you are doing A-B testing on a change that doesn't affect how the DB is written, then the best approach is to build a single DB and test read performance under both A and B configurations on that DB, opened read-only. You can even run the A and B tests simultaneously so that each is similarly affected by any noise from other processes on the system.

And of course one of the big challenges is that performance characteristics can change dramatically for small DBs vs. large DBs, and large DBs take a very long time to construct.

0
On

Do I need to wait for background compaction to finish after creating test data to do a good read benchmark?

You should run and report benchmarks on both cases. While it is compacting and at idle.

In production, users won't wait to use the app because the DB is compacting