I have a table in my database called "mytable," which contains a "date" field with JSONB data like this:
{
...
"myfield": "value1"
}
The "myfield" field can currently have values "value1," "value2," and "value3." Additionally, "myfield" can be null, or it might not exist at all.
In my queries, I need to find rows where "myfield" is not equal to "value1."
select * from mytable where data->>'myfield' <> 'value1'
I've created a partial index on the "mytable" table to speed up the query:
CREATE INDEX "idx_mytable_myfield" ON "mytable" USING btree (
((data->>'myfield'::text) COLLATE "pg_catalog"."default" "pg_catalog"."text_ops" ASC NULLS LAST
) WHERE
(data->>'myfield'::text)::text <> 'value1'::text
)
However, the query is not using this index. At the same time, the selectivity of the query is high, and a sequential scan is significantly slower.
If I rewrite the query and index to use "in" instead of "<>", the index is used in the query. But this means I'll have to rewrite both the query and the index when adding new values in the future:
select * from mytable where (data->>'myfield'::text)::text in ('value2'::text,'value3'::text)
Can you please help me understand what I'm doing wrong? Is there a way to rewrite the query and/or the index?
To address possible questions in advance:
- I've run
ANALYZE mytableafter each index change. - The query has high selectivity.
- ChatGPT asked about this and suggested exactly what I'm trying to do.
- My PostgreSQL version is 12.14.
- I don't want to build not partial index on data->>'myfield' or GIN index on all data field because of perfomance and index size
P.S. I've tried adding additional conditions to both the index and the query, but it didn't help:
...
and data is not null
and data->'myfield'::text is not null
CREATE INDEX "mytable_myfield_idx" ON "set10"."mytable" USING btree (
("data" ->> 'myfield'::text) COLLATE "pg_catalog"."default" "pg_catalog"."text_ops" ASC NULLS LAST
)
WHERE ("data" ->> 'myfield'::text <> 'нет');
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, COSTS)
SELECT
*
FROM
mytable
WHERE
"data" ->> 'myfield'::text <> 'нет'
Seq Scan on set10.mytable (cost=0.00..2041.22 rows=86911 width=35) (actual time=19.098..35.258 rows=74 loops=1)
Output: "jiraKey", data, created
Filter: ((mytable.data ->> 'myfield'::text) <> 'нет'::text)
Rows Removed by Filter: 87274
Buffers: shared hit=731
Planning Time: 0.051 ms
Execution Time: 35.278 ms
And my experiments show that the problem is in "<>", the optimizer won't use the index in this case.
upd
if I force to disable seqscan - the optimizer uses my index properly. For the some reason, it thinks that it is better to use seqscan, but it wrong
SET enable_seqscan = false;
EXPLAIN (ANALYZE,BUFFERS, COSTS, VERBOSE)
select
*
FROM
mytable
WHERE
"data"->>'myfield' <> 'нет'
Bitmap Heap Scan on set10.mytable (cost=30.20..2064.86 rows=86911 width=35) (actual time=0.024..0.098 rows=74 loops=1)
Output: "jiraKey", data, created
Recheck Cond: ((mytable.data ->> 'myfield'::text) <> 'нет'::text)
Heap Blocks: exact=52
Buffers: shared hit=53
-> Bitmap Index Scan on mytable_myfield_idx (cost=0.00..8.47 rows=86911 width=0) (actual time=0.013..0.014 rows=74 loops=1)
Buffers: shared hit=1
Planning Time: 0.059 ms
Execution Time: 0.128 ms
In order to let it get an accurate estimate of the row counts, you need an expression index like this:
And then do an ANALYZE after it is created. You already have that expression index in a partial form, but partial indexes are not used by the planner to derive rows counts.
If you were using a modern version of the software, you could instead create extended statistics on the expression:
This will generate the same statistics as the index does, but doesn't have the same storage or maintenance needs as the index. But it requires v14 or above.