I've got a table users
with fields id
and email
. id
is the primary key and email
is indexed as well.
database> \d users
+-----------------------------+-----------------------------+-----------------------------------------------------+
| Column | Type | Modifiers |
|-----------------------------+-----------------------------+-----------------------------------------------------|
| id | integer | not null default nextval('users_id_seq'::regclass) |
| email | character varying | |
+-----------------------------+-----------------------------+-----------------------------------------------------+
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
"index_users_on_email" UNIQUE, btree (email)
If I query the table with a distinct on (email)
clause in a subquery I get a significant performance penalty.
database> explain (analyze, buffers)
select
id
from (
select distinct on (email)
id
from
users
) as t
where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------------------------------------------------|
| Subquery Scan on t (cost=8898.69..10077.84 rows=337 width=4) (actual time=221.133..250.782 rows=1 loops=1) |
| Filter: (t.id = 123) |
| Rows Removed by Filter: 67379 |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Unique (cost=8898.69..9235.59 rows=67380 width=24) (actual time=221.121..247.582 rows=67380 loops=1) |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Sort (cost=8898.69..9067.14 rows=67380 width=24) (actual time=221.120..239.573 rows=67380 loops=1) |
| Sort Key: users.email |
| Sort Method: external merge Disk: 2304kB |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Seq Scan on users (cost=0.00..3494.80 rows=67380 width=24) (actual time=0.009..9.714 rows=67380 loops=1) |
| Buffers: shared hit=2821 |
| Planning Time: 0.243 ms |
| Execution Time: 251.258 ms |
+-----------------------------------------------------------------------------------------------------------------------------+
Compare this with distinct on (id)
whose cost is less than one thousandth of the previous query.
database> explain (analyze, buffers)
select
id
from (
select distinct on (id)
id
from
users
) as t
where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------------------------------------------------|
| Unique (cost=0.29..8.31 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1) |
| Buffers: shared hit=3 |
| -> Index Only Scan using users_pkey on users (cost=0.29..8.31 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1) |
| Index Cond: (id = 123) |
| Heap Fetches: 1 |
| Buffers: shared hit=3 |
| Planning Time: 0.090 ms |
| Execution Time: 0.034 ms |
+-----------------------------------------------------------------------------------------------------------------------------+
Why is this?
The real problem I'm having is that I'm trying to create a view that does distinct on
an indexed column that isn't unique and the performance is very bad.
Logical difference
Both columns
id
andemail
areUNIQUE
. But onlyid
isNOT NULL
. (PRIMARY KEY
columns always are.)NULL
values are not considered equal, multipleNULL
values are allowed in a column withUNIQUE
constraint (or index). That's according to standard SQL. See:But
DISTINCT
orDISTINCT ON
consider NULL values equal. The manual:Bold emphasis mine. Further reading:
In your second query,
distinct on (id)
is a logical no-op: the result is guaranteed to be the same as withoutDISTINCT ON
. And since the outerSELECT
filters onid = 123
, Postgres can strip the noise and work off a very cheap index-only scan.In your first query, on the other hand,
distinct on (email)
might actually do something if there are more than one rows withemail IS NULL
. Then Postgres has to pick the firstid
according to the given sort order. Since there is noORDER BY
, it results in an arbitrary pick. But the outerSELECT
with the predicatewhere id = 123
may depend on the result. The whole query is different from the first on principle - and broken by design.Serendipity
All that aside, two "lucky" findings stick out:
Mention of "Disk" indicates insufficient work_mem. See:
In my tests, I always got an index scan here. Indicates a bloated index or some other problem with your setup.
Useful comparison?
The comparison is going nowhere. We might learn something from comparing the first query with this query - after switching roles of PK and UNIQUE column:
Or from comparing the second query with this one - trying the same with the UNIQUE column instead of the PK column:
We learn that PK and UNIQUE constraint show no different effect on query plans. Postgres does not use the meta-information to cut corners. The PK would actually make a difference with
GROUP BY
. See:So this works:
But the same does not work after switching
id
andemail
. I added some demos to the fiddle:db<>fiddle here
So?
Both queries are nonsense for different reasons. I don't see how they can help with your real query:
We'd need to see your real query - and all other relevant details of your setup. There may be solutions, but that may well beyond the scope of a question on SO. Consider hiring a consultant. Or consider one of these methods to optimize performance: