I'm hoping to trade accuracy for speed by using SAMPLE. While some queries do run faster, the query_log shows that all rows are being read and more bytes are read than when not using SAMPLE. I thought SAMPLE would result in fewer row reads.
My questions:
Should using SAMPLE result in more rows and bytes being read?
Is this increased reading due to a sub-optimal table definition that can be corrected?
I'm using ClickHouse version 20.5.3 revision 54435.
Table definition:
CREATE TABLE default.table_one
(
`datestamp` Date,
`timestamp` UInt64,
`sample_hash` UInt64,
`value` UInt32,
...
)
ENGINE = MergeTree()
PARTITION BY date_stamp
ORDER BY (datestamp, timestamp, sample_hash)
SAMPLE BY sample_hash
SETTINGS index_granularity = 8192
Query without SAMPLE
SELECT
avg(value)
FROM default.table_one;
query_duration_ms: 166
rows_read: 100,000,000
read_bytes: 800,000,000
Query with SAMPLE
SELECT
avg(value)
FROM default.table_one
SAMPLE 0.1;
query_duration_ms: 358
rows_read: 100,000,000
read_bytes: 1,600,000,000
Short answer: yes. Because CH needs to read one more column sample_hash.
Long answer: Sampling is hard. It is useful if you have 100 billions rows per day and 400 servers. It helps with groupbys a lot. It's not helpful in filtering because in your case it does not work along with primary index. Yandex has designed sampling for themselves. They have enabled forced partitioningkeys/primarykeys usage (force_index_by_date/force_primary_key). So queries like yours are impossible in their system and thus sampling helps them even in disk reading.
That's why I don't use Sampling in my systems.
But
Also such
ORDER BY
is not useful at all. This whole table is misdesign. No sense to put(datestamp
in the index prefix because the table is partitioned bydatestamp
so each partition has only ONEdatestamp
value.timestamp
in the index prefix is an even bigger issue because it's very unwise to put a highly cardinal column in the beginning of a primary index.So. I can create a synthetic example and show how
sampling
works. But does it have any sense?