I'm hoping to trade accuracy for speed by using SAMPLE. While some queries do run faster, the query_log shows that all rows are being read and more bytes are read than when not using SAMPLE. I thought SAMPLE would result in fewer row reads.

My questions:

  1. Should using SAMPLE result in more rows and bytes being read?

  2. Is this increased reading due to a sub-optimal table definition that can be corrected?

I'm using ClickHouse version 20.5.3 revision 54435.

Table definition:

CREATE TABLE default.table_one
(
  `datestamp` Date,
  `timestamp` UInt64,
  `sample_hash` UInt64,
  `value` UInt32,
  ...
)
ENGINE = MergeTree()
PARTITION BY date_stamp
ORDER BY (datestamp, timestamp, sample_hash)
SAMPLE BY sample_hash
SETTINGS index_granularity = 8192

Query without SAMPLE

SELECT
  avg(value)
FROM default.table_one;

query_duration_ms: 166
rows_read: 100,000,000
read_bytes: 800,000,000

Query with SAMPLE

SELECT
  avg(value)
FROM default.table_one
SAMPLE 0.1;

query_duration_ms: 358
rows_read: 100,000,000
read_bytes: 1,600,000,000
2

There are 2 best solutions below

0
On BEST ANSWER

Short answer: yes. Because CH needs to read one more column sample_hash.

Long answer: Sampling is hard. It is useful if you have 100 billions rows per day and 400 servers. It helps with groupbys a lot. It's not helpful in filtering because in your case it does not work along with primary index. Yandex has designed sampling for themselves. They have enabled forced partitioningkeys/primarykeys usage (force_index_by_date/force_primary_key). So queries like yours are impossible in their system and thus sampling helps them even in disk reading.

That's why I don't use Sampling in my systems.

But

ORDER BY (datestamp, timestamp, sample_hash)

Also such ORDER BY is not useful at all. This whole table is misdesign. No sense to put (datestamp in the index prefix because the table is partitioned by datestamp so each partition has only ONE datestamp value.

timestamp in the index prefix is an even bigger issue because it's very unwise to put a highly cardinal column in the beginning of a primary index.

So. I can create a synthetic example and show how sampling works. But does it have any sense?

CREATE TABLE table_one
( timestamp UInt64,
  transaction_id UInt64,
  banner_id UInt16,
  value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)),  cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192


insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
from numbers(10000000000);



select banner_id, sum(value), count(value), max(value)
from table_one 
group by banner_id format Null;

0 rows in set. Elapsed: 11.490 sec. Processed 10.00 billion rows, 60.00 GB (870.30 million rows/s., 5.22 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;

0 rows in set. Elapsed: 1.316 sec. Processed 452.67 million rows, 6.34 GB (343.85 million rows/s., 4.81 GB/s.)



select banner_id, sum(value), count(value), max(value)
from table_one 
WHERE banner_id = 42
group by banner_id format Null;

0 rows in set. Elapsed: 0.020 sec. Processed 10.30 million rows, 61.78 MB (514.37 million rows/s., 3.09 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;

0 rows in set. Elapsed: 0.008 sec. Processed 696.32 thousand rows, 9.75 MB (92.49 million rows/s., 1.29 GB/s.)




select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one 
group by banner_id, hr format Null;
0 rows in set. Elapsed: 36.660 sec. Processed 10.00 billion rows, 140.00 GB (272.77 million rows/s., 3.82 GB/s.)

select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 3.741 sec. Processed 452.67 million rows, 9.96 GB (121.00 million rows/s., 2.66 GB/s.)




select count()
from table_one 
where value = 666 format Null;
1 rows in set. Elapsed: 6.056 sec. Processed 10.00 billion rows, 40.00 GB (1.65 billion rows/s., 6.61 GB/s.)

select count()
from table_one  SAMPLE 0.01
where value = 666 format Null;
1 rows in set. Elapsed: 1.214 sec. Processed 452.67 million rows, 5.43 GB (372.88 million rows/s., 4.47 GB/s.)
0
On

Hard part: Here is an example how a high cardinality column in the primary index impacts. The same table, the same data but instead of

ORDER BY (banner_id, toStartOfHour(toDateTime(timestamp)), cityHash64(transaction_id))

I used

ORDER BY (banner_id, timestamp, cityHash64(transaction_id))

CREATE TABLE table_one
( timestamp UInt64,
  transaction_id UInt64,
  banner_id UInt16,
  value UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(toDateTime(timestamp))
ORDER BY (banner_id, timestamp, cityHash64(transaction_id))
SAMPLE BY cityHash64(transaction_id)
SETTINGS index_granularity = 8192

insert into table_one select 1602809234+intDiv(number,100000), number, number%991, toUInt32(rand())
from numbers(10000000000);



select banner_id, sum(value), count(value), max(value)
from table_one 
group by banner_id format Null;
0 rows in set. Elapsed: 11.196 sec. Processed 10.00 billion rows, 60.00 GB (893.15 million rows/s., 5.36 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id format Null;
0 rows in set. Elapsed: 24.378 sec. Processed 10.00 billion rows, 140.00 GB (410.21 million rows/s., 5.74 GB/s.)



select banner_id, sum(value), count(value), max(value)
from table_one 
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.022 sec. Processed 10.27 million rows, 61.64 MB (459.28 million rows/s., 2.76 GB/s.)

select banner_id, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
WHERE banner_id = 42
group by banner_id format Null;
0 rows in set. Elapsed: 0.037 sec. Processed 10.27 million rows, 143.82 MB (275.16 million rows/s., 3.85 GB/s.)



select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one 
group by banner_id, hr format Null;
0 rows in set. Elapsed: 21.663 sec. Processed 10.00 billion rows, 140.00 GB (461.62 million rows/s., 6.46 GB/s.)


select banner_id, toStartOfHour(toDateTime(timestamp)) hr, sum(value), count(value), max(value)
from table_one SAMPLE 0.01
group by banner_id, hr format Null;
0 rows in set. Elapsed: 26.697 sec. Processed 10.00 billion rows, 220.00 GB (374.57 million rows/s., 8.24 GB/s.)



select count()
from table_one 
where value = 666 format Null;
0 rows in set. Elapsed: 7.679 sec. Processed 10.00 billion rows, 40.00 GB (1.30 billion rows/s., 5.21 GB/s.)

select count()
from table_one  SAMPLE 0.01
where value = 666 format Null;
0 rows in set. Elapsed: 21.668 sec. Processed 10.00 billion rows, 120.00 GB (461.51 million rows/s., 5.54 GB/s.)

High cardinal column timestamp makes to impossible use a range seeks in index for cityHash64(transaction_id)). CH reads every mark for 0.01 piece. This is the expected behavior and the same is true for any database or any sorted list.

Now CH reads all rows with .001 sampling and without.