What is the error rate of DuckDB approx_count_distinct for small cardinalities?

115 Views Asked by At

I've been trying to understand how to measure the error rate for approx_count_distinct in DuckDB. The database doesn't provide a way to output it. As I read from the science paper the error increases if the cardinality is low or/and it depends on the number of buckets in hyperloglog implementation but there's no formula to calculate it. Is there a way to show the possible error for this function?

1

There are 1 best solutions below

0
egor10_4 On

I've used this approach to output the error rate for very small cardinalities.

import duckdb
duckdb.sql('create table a(a bigint)')

c = 1
c_p = 0
for x in range(10):
    c *= 10
    print ("c: %d" % c)
    duckdb.sql('INSERT INTO a SELECT * FROM range(%d, %d)' % (c_p, c))
    duckdb.sql('SELECT approx_count_distinct(a) count, abs(%d-count)::DOUBLE/%d error_rate FROM a' % (c, c)).show()
    c_p = c
c: 10
┌───────┬────────────┐
│ count │ error_rate │
│ int64 │   double   │
├───────┼────────────┤
│    10 │        0.0 │
└───────┴────────────┘

c: 100
┌───────┬────────────┐
│ count │ error_rate │
│ int64 │   double   │
├───────┼────────────┤
│    98 │       0.02 │
└───────┴────────────┘

c: 1000
┌───────┬────────────┐
│ count │ error_rate │
│ int64 │   double   │
├───────┼────────────┤
│  1011 │      0.011 │
└───────┴────────────┘

c: 10000
┌───────┬────────────┐
│ count │ error_rate │
│ int64 │   double   │
├───────┼────────────┤
│ 10066 │     0.0066 │
└───────┴────────────┘

c: 100000
┌───────┬────────────┐
│ count │ error_rate │
│ int64 │   double   │
├───────┼────────────┤
│ 99492 │    0.00508 │
└───────┴────────────┘

c: 1000000
┌─────────┬────────────┐
│  count  │ error_rate │
│  int64  │   double   │
├─────────┼────────────┤
│ 1007982 │   0.007982 │
└─────────┴────────────┘

c: 10000000
┌──────────┬────────────┐
│  count   │ error_rate │
│  int64   │   double   │
├──────────┼────────────┤
│ 10037331 │  0.0037331 │
└──────────┴────────────┘

c: 100000000
┌───────────┬────────────┐
│   count   │ error_rate │
│   int64   │   double   │
├───────────┼────────────┤
│ 100310782 │ 0.00310782 │
└───────────┴────────────┘

c: 1000000000
┌───────────┬─────────────┐
│   count   │ error_rate  │
│   int64   │   double    │
├───────────┼─────────────┤
│ 976524121 │ 0.023475879 │
└───────────┴─────────────┘

You can see that the error rate definitely goes down with the cardinality increase but not always (see the last test run).