I've got a table users with fields id and email. id is the primary key and email is indexed as well.
database> \d users
+-----------------------------+-----------------------------+-----------------------------------------------------+
| Column | Type | Modifiers |
|-----------------------------+-----------------------------+-----------------------------------------------------|
| id | integer | not null default nextval('users_id_seq'::regclass) |
| email | character varying | |
+-----------------------------+-----------------------------+-----------------------------------------------------+
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
"index_users_on_email" UNIQUE, btree (email)
If I query the table with a distinct on (email) clause in a subquery I get a significant performance penalty.
database> explain (analyze, buffers)
select
id
from (
select distinct on (email)
id
from
users
) as t
where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------------------------------------------------|
| Subquery Scan on t (cost=8898.69..10077.84 rows=337 width=4) (actual time=221.133..250.782 rows=1 loops=1) |
| Filter: (t.id = 123) |
| Rows Removed by Filter: 67379 |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Unique (cost=8898.69..9235.59 rows=67380 width=24) (actual time=221.121..247.582 rows=67380 loops=1) |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Sort (cost=8898.69..9067.14 rows=67380 width=24) (actual time=221.120..239.573 rows=67380 loops=1) |
| Sort Key: users.email |
| Sort Method: external merge Disk: 2304kB |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Seq Scan on users (cost=0.00..3494.80 rows=67380 width=24) (actual time=0.009..9.714 rows=67380 loops=1) |
| Buffers: shared hit=2821 |
| Planning Time: 0.243 ms |
| Execution Time: 251.258 ms |
+-----------------------------------------------------------------------------------------------------------------------------+
Compare this with distinct on (id) whose cost is less than one thousandth of the previous query.
database> explain (analyze, buffers)
select
id
from (
select distinct on (id)
id
from
users
) as t
where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------------------------------------------------|
| Unique (cost=0.29..8.31 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1) |
| Buffers: shared hit=3 |
| -> Index Only Scan using users_pkey on users (cost=0.29..8.31 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1) |
| Index Cond: (id = 123) |
| Heap Fetches: 1 |
| Buffers: shared hit=3 |
| Planning Time: 0.090 ms |
| Execution Time: 0.034 ms |
+-----------------------------------------------------------------------------------------------------------------------------+
Why is this?
The real problem I'm having is that I'm trying to create a view that does distinct on an indexed column that isn't unique and the performance is very bad.
Logical difference
Both columns
idandemailareUNIQUE. But onlyidisNOT NULL. (PRIMARY KEYcolumns always are.)NULLvalues are not considered equal, multipleNULLvalues are allowed in a column withUNIQUEconstraint (or index). That's according to standard SQL. See:But
DISTINCTorDISTINCT ONconsider NULL values equal. The manual:Bold emphasis mine. Further reading:
In your second query,
distinct on (id)is a logical no-op: the result is guaranteed to be the same as withoutDISTINCT ON. And since the outerSELECTfilters onid = 123, Postgres can strip the noise and work off a very cheap index-only scan.In your first query, on the other hand,
distinct on (email)might actually do something if there are more than one rows withemail IS NULL. Then Postgres has to pick the firstidaccording to the given sort order. Since there is noORDER BY, it results in an arbitrary pick. But the outerSELECTwith the predicatewhere id = 123may depend on the result. The whole query is different from the first on principle - and broken by design.Serendipity
All that aside, two "lucky" findings stick out:
Mention of "Disk" indicates insufficient work_mem. See:
In my tests, I always got an index scan here. Indicates a bloated index or some other problem with your setup.
Useful comparison?
The comparison is going nowhere. We might learn something from comparing the first query with this query - after switching roles of PK and UNIQUE column:
Or from comparing the second query with this one - trying the same with the UNIQUE column instead of the PK column:
We learn that PK and UNIQUE constraint show no different effect on query plans. Postgres does not use the meta-information to cut corners. The PK would actually make a difference with
GROUP BY. See:So this works:
But the same does not work after switching
idandemail. I added some demos to the fiddle:db<>fiddle here
So?
Both queries are nonsense for different reasons. I don't see how they can help with your real query:
We'd need to see your real query - and all other relevant details of your setup. There may be solutions, but that may well beyond the scope of a question on SO. Consider hiring a consultant. Or consider one of these methods to optimize performance: