I am facing a strange and infuriating problem of my PostgreSQL 10 database not using indexes in specific situations. I have created an index of an md5 hash of one of the columns, and this index seems to work fine under normal circumstances:
EXPLAIN
SELECT * FROM variants_table WHERE md5(variant_id) = md5('1_9746683_A_G');
QUERY PLAN
Index Scan using idx_variant_id_md5 on variants_table (cost=0.71..50470.36 rows=12473 width=748)
Index Cond: (md5(variant_id) = '00fa85f0ea97a4aa2898f838f427e560'::text)
(2 rows)
However, when I try to delete duplicates using this query:
EXPLAIN
WITH Ranked AS (
SELECT
md5(variant_id) as variant_id_md5,
consscore,
ROW_NUMBER() OVER (PARTITION BY md5(variant_id) ORDER BY consscore DESC) as rn
FROM
variants_table
)
DELETE FROM variants_table
WHERE (md5(variant_id), consscore) IN (
SELECT variant_id_md5, consscore
FROM Ranked
WHERE rn > 1
);
QUERY PLAN
Delete on variants_table (cost=3647060725.61..4520666023.49 rows=2949470436 width=70)
CTE ranked
-> WindowAgg (cost=3096491910.93..3361944250.15 rows=11797881743 width=48)
-> Sort (cost=3096491910.93..3125986615.29 rows=11797881743 width=40)
Sort Key: (md5(variants_table_1.variant_id)), variants_table_1.consscore DESC
-> Seq Scan on variants_table variants_table_1 (cost=0.00..800237220.79 rows=11797881743 width=40)
-> Hash Join (cost=285116475.46..1158721773.34 rows=2949470436 width=70)
Hash Cond: ((md5(variants_table.variant_id) = ranked.variant_id_md5) AND (variants_table.consscore = ranked.consscore))
-> Seq Scan on variants_table (cost=0.00..770742516.43 rows=11797881743 width=46)
-> Hash (cost=285115875.46..285115875.46 rows=40000 width=104)
-> HashAggregate (cost=285115475.46..285115875.46 rows=40000 width=104)
Group Key: ranked.variant_id_md5, ranked.consscore
-> CTE Scan on ranked (cost=0.00..265452339.22 rows=3932627248 width=104)
Filter: (rn > 1)
(14 rows)
It defaults to a sequential scan.
I have tried everything to debug this including setting enable_seqscan = off, but no matter what, the query planner still uses the sequential scan, leading me to assume that it doesn't know how to use the indexes.
Problem
The window function
row_number()has to process the whole table, so it doesn't pay to use your index to begin with. Hence the query plan descends into a sequential scan.Better query
Postgres 15 brought some optimizations for
row_number(), and the upcoming Postgres 16 ships with more, but I am not sure how much yourDELETEquery can profit.Be that as it may, this simpler query should perform better in any case:
Read:
"Delete rows where another row with the same
md5(variant_id)and a greaterconsscoreexists in the same table!"If there can be complete duplicates on
(md5(variant_id), consscore), this query preserves all dupes with the highestconsscore- unlike your original query. But your original query keeps an arbitrary winner, which is typically bad design, and you should really add a deterministic tiebreaker in this case - which you can use in my query accordingly.If an arbitrary pick is good enough, you can use the system column
ctidin a row-value comparison for this:About row-value comparison:
Note that my direct comparison works differently for
nullvalues than your original. Typically, you don't want to use my query ifvariant_idorconsscorecan benull. See:If
nullvalues can be involved, you'll have to define the desired behavior exactly.If this is going to delete the majority of rows, other strategies may be more efficient. Like copying the few survivors into a pristine new table. See:
But I'd assume you are hunting for few duplicates. Consider making that expression index
UNIQUEafter you have cleaned up to prevent more dupes from creeping in. If possible.Aside
Your query should be faster, yet, after optimizing that expression index with the data type
uuidinstead of text. See:Or maybe a
biginthash is good enough? See: