I have a table called users that contains over 2 millions records with the following structure:
create table users {
id integer not null,
langs jsonb
}
The column langs contains the following data in rows:
{"default_lang": "en", "custom_langs": ["en", "de"]}
{"default_lang": "fr", "custom_langs": ["en", "de"]}
{"default_lang": "pt", "custom_langs": ["en", "de", "fr"]}
The idea is to select only rows whose default_lang is not contained in custom_langs. So, I am able to achieve it like so:
SELECT id, langs
FROM users
WHERE deleted=false and (langs->'default_lang' <@ (langs->'custom_langs') = false)
I have also created the following index:
CREATE INDEX IF NOT EXISTS users_langs ON users
USING btree ((langs->'default_lang'), (langs->'custom_langs'))
where langs->'default_lang' <@ (langs->'custom_langs') = false;
But the index users_langs is not being applied. Any ideas on how to fix this. thanks in advance.
Here the query plan result:
HashAggregate (cost=957627.32..1002810.29 rows=2536588 width=8) (actual time=17095.651..17096.210 rows=2233 loops=1)
Output: id
Group Key: users.id
Planned Partitions: 64 Batches: 1 Memory Usage: 913kB
Buffers: shared hit=47315 read=762287
I/O Timings: read=11079.509
-> Seq Scan on public.users (cost=0.00..821285.71 rows=2536588 width=8) (actual time=17.969..17086.833 rows=2233 loops=1)
Output: id
Filter: ((NOT users.deleted) AND (NOT ((users.langs -> 'default_langs'::text) <@ (users.langs -> 'custom_langs'::text))))
Rows Removed by Filter: 2774667
Buffers: shared hit=47315 read=762287
I/O Timings: read=11079.509
Settings: effective_cache_size = '8041608kB', jit = 'off'
Query Identifier: -9027923201169468774
Planning:
Buffers: shared hit=427 read=32
I/O Timings: read=4.276
Planning Time: 5.737 ms
Execution Time: 17096.749 ms
The partial index is usable, and will be used if you disable seq scans (
set enable_seqscan=off), but it won't be used normally just because it is believed to be slower than the full table scan would, because the row estimates are so wrong.Unfortunately, I can't see how to get the estimates correct, not even by using custom statistics on the expression. I don't know why the custom statistics don't work. You could use the 3rd party extension pg_hint_plan to force an index scan.
After playing with this a bit more, the problem is the
= false. The partial index code knows how to deal with that, but apparently the custom statistics code does not. If you create the stats and analyze the table:then the stats will be right but only if the query is written with
is falsenot= false. Then an index will be used (but only if it too is changed tois false).