Unable to use index in specific situations

Question

Unable to use index in specific situations

65 Views Asked by Thomas Li At 09 August 2023 at 20:27

I am facing a strange and infuriating problem of my PostgreSQL 10 database not using indexes in specific situations. I have created an index of an md5 hash of one of the columns, and this index seems to work fine under normal circumstances:

EXPLAIN
SELECT * FROM variants_table WHERE md5(variant_id) = md5('1_9746683_A_G');

QUERY PLAN
Index Scan using idx_variant_id_md5 on variants_table  (cost=0.71..50470.36 rows=12473 width=748)
   Index Cond: (md5(variant_id) = '00fa85f0ea97a4aa2898f838f427e560'::text)
(2 rows)

However, when I try to delete duplicates using this query:

EXPLAIN
WITH Ranked AS (
   SELECT
     md5(variant_id) as variant_id_md5,
     consscore,
     ROW_NUMBER() OVER (PARTITION BY md5(variant_id) ORDER BY consscore DESC) as rn
   FROM
     variants_table
 )
DELETE FROM variants_table
WHERE (md5(variant_id), consscore) IN (
   SELECT variant_id_md5, consscore
   FROM Ranked
   WHERE rn > 1
 );

QUERY PLAN
 Delete on variants_table  (cost=3647060725.61..4520666023.49 rows=2949470436 width=70)
   CTE ranked
     ->  WindowAgg  (cost=3096491910.93..3361944250.15 rows=11797881743 width=48)
           ->  Sort  (cost=3096491910.93..3125986615.29 rows=11797881743 width=40)
                 Sort Key: (md5(variants_table_1.variant_id)), variants_table_1.consscore DESC
                 ->  Seq Scan on variants_table variants_table_1  (cost=0.00..800237220.79 rows=11797881743 width=40)
   ->  Hash Join  (cost=285116475.46..1158721773.34 rows=2949470436 width=70)
         Hash Cond: ((md5(variants_table.variant_id) = ranked.variant_id_md5) AND (variants_table.consscore = ranked.consscore))
         ->  Seq Scan on variants_table  (cost=0.00..770742516.43 rows=11797881743 width=46)
         ->  Hash  (cost=285115875.46..285115875.46 rows=40000 width=104)
               ->  HashAggregate  (cost=285115475.46..285115875.46 rows=40000 width=104)
                     Group Key: ranked.variant_id_md5, ranked.consscore
                     ->  CTE Scan on ranked  (cost=0.00..265452339.22 rows=3932627248 width=104)
                           Filter: (rn > 1)
(14 rows)

It defaults to a sequential scan.

I have tried everything to debug this including setting enable_seqscan = off, but no matter what, the query planner still uses the sequential scan, leading me to assume that it doesn't know how to use the indexes.

Original Q&A

There are 1 best solutions below

**Erwin Brandstetter** · Answer 1 · 2023-08-10T02:09:04.510000

Problem

The window function row_number() has to process the whole table, so it doesn't pay to use your index to begin with. Hence the query plan descends into a sequential scan.

Better query

Postgres 15 brought some optimizations for row_number(), and the upcoming Postgres 16 ships with more, but I am not sure how much your DELETE query can profit.

Be that as it may, this simpler query should perform better in any case:

DELETE FROM variants_table d
WHERE  EXISTS (
   SELECT FROM variants_table v
   WHERE  md5(v.variant_id) = md5(d.variant_id)
   AND    v.consscore > d.consscore
   );

Read:
"Delete rows where another row with the same md5(variant_id) and a greater consscore exists in the same table!"

If there can be complete duplicates on (md5(variant_id), consscore), this query preserves all dupes with the highest consscore - unlike your original query. But your original query keeps an arbitrary winner, which is typically bad design, and you should really add a deterministic tiebreaker in this case - which you can use in my query accordingly.

If an arbitrary pick is good enough, you can use the system column ctid in a row-value comparison for this:

DELETE FROM variants_table d
WHERE  EXISTS (
   SELECT FROM variants_table v
   WHERE  md5(v.variant_id) = md5(d.variant_id)
   AND    (v.consscore, v.ctid) > (d.consscore, d.ctid)
   );

About row-value comparison:

SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

Note that my direct comparison works differently for null values than your original. Typically, you don't want to use my query if variant_id or consscore can be null. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

If null values can be involved, you'll have to define the desired behavior exactly.

If this is going to delete the majority of rows, other strategies may be more efficient. Like copying the few survivors into a pristine new table. See:

Best way to delete millions of rows by ID

But I'd assume you are hunting for few duplicates. Consider making that expression index UNIQUE after you have cleaned up to prevent more dupes from creeping in. If possible.

Aside

Your query should be faster, yet, after optimizing that expression index with the data type uuid instead of text. See:

Or maybe a bigint hash is good enough? See:

Why doesn't my UNIQUE constraint trigger?

Unable to use index in specific situations

There are 1 best solutions below

Problem

Better query

Aside

Related Questions in SQL

Related Questions in POSTGRESQL

Related Questions in INDEXING

Related Questions in POSTGRESQL-10

Related Questions in POSTGRESQL-PERFORMANCE

Trending Questions

Popular # Hahtags

Popular Questions