Simple WHERE EXISTS ... ORDER BY... query very slow in PostrgeSQL

1k Views Asked by At

I have this very simple query, generated by my ORM (Entity Framework Core):

SELECT *
 FROM "table1" AS "t1"
 WHERE EXISTS (
     SELECT 1
     FROM "table2" AS "t2"
     WHERE ("t2"."is_active" = TRUE) AND ("t1"."table2_id" = "t2"."id"))
 ORDER BY "t1"."table2_id"
  1. There are 2 "is_active" records. The other involved columns ("id") are the primary keys. Query returns exactly 4 rows.
  2. Table 1 is 96 million records.
  3. Table 2 is 30 million records.
  4. The 3 columns involved in this query are indexed (is_active, id, table2_id).
  5. The C#/LINQ code that generates this simple query is: Table2.Where(t => t.IsActive).Include(t => t.Table1).ToList();`
  6. SET STATISTICS 10000 was set to all of the 3 columns.
  7. VACUUM FULL ANALYZE was run on both tables.

WITHOUT the ORDER BY clause, the query returns within a few milliseconds, and I’d expect nothing else for 4 records to return. EXPLAIN output:

Nested Loop  (cost=1.13..13.42 rows=103961024 width=121)
  ->  Index Scan using table2_is_active_idx on table2  (cost=0.56..4.58 rows=1 width=8)
        Index Cond: (is_active = true)
        Filter: is_active
  ->  Index Scan using table1_table2_id_fkey on table1 t1 (cost=0.57..8.74 rows=10 width=121)
        Index Cond: (table2_id = table1.id)

WITH the ORDER BY clause, the query takes 5 minutes to complete! EXPLAIN output:

Merge Semi Join  (cost=10.95..4822984.67 rows=103961040 width=121)
  Merge Cond: (t1.table2_id = t2.id)
  ->  Index Scan using table1_table2_id_fkey on table1 t1  (cost=0.57..4563070.61 rows=103961040 width=121)
  ->  Sort  (cost=4.59..4.59 rows=2 width=8)
        Sort Key: t2.id
        ->  Index Scan using table2_is_active_idx on table2 a  (cost=0.56..4.58 rows=2 width=8)
              Index Cond: (is_active = true)
              Filter: is_active

The inner, first index scan should return no more than 2 rows. Then the outer, second index scan doesn't make any sense with its cost of 4563070 and 103961040 rows. It only has to match 2 rows in table2 with 4 rows in table1!

This is a very simple query with very few records to return. Why is Postgres failing to execute it properly?

3

There are 3 best solutions below

2
On BEST ANSWER

Ok I solved my problem in the most unexpected way. I upgraded Postgresql from 9.6.1 to 9.6.3. And that was it. After restarting the service, the explain plan now looked good and the query ran just fine this time. I did not change anything, no new index, nothing. The only explanation I can think of is that there is was a query planner bug in 9.6.1 and solved in 9.6.3. Thank you all for your answers!

2
On

Add an index:

CREATE INDEX _index 
ON table2 
USING btree (id) 
WHERE is_active IS TRUE;

And rewrite query like this

SELECT table1.*
FROM table2
INNER JOIN table1 ON (table1.table2_id = table2.id)
WHERE table2.is_active IS TRUE 
ORDER BY table2.id

It is necessary to take into account that "is_active IS TRUE" and "is_active = TRUE" process by PostgreSQL in different ways. So the expression in the index predicate and the query must match.

If u can't rewrite query try add an index:

CREATE INDEX _index 
ON table2 
USING btree (id) 
WHERE is_active = TRUE;
0
On

Your guess is right, there is a bug in Postgres 9.6.1 that fits your use case exactly. And upgrading was the right thing to do. Upgrading to the latest point-release is always the right thing to do.

Quoting the release notes for Postgres 9.6.2:

  • Fix foreign-key-based join selectivity estimation for semi-joins and anti-joins, as well as inheritance cases (Tom Lane)

    The new code for taking the existence of a foreign key relationship into account did the wrong thing in these cases, making the estimates worse not better than the pre-9.6 code.

You should still create that partial index like Dima advised. But keep it simple:

is_active = TRUE and is_active IS TRUE subtly differ in that the second returns FALSE instead of NULL for NULL input. But none of this matters in a WHERE clause where only TRUE qualifies. And both expressions are just noise. In Postgres you can use boolean values directly:

CREATE INDEX t2_id_idx ON table2 (id) WHERE is_active;  -- that's all

And do not rewrite your query with a LEFT JOIN. This would add rows consisting of NULL values to the result for "active" rows in table2 without any siblings in table1. To match your current logic it would have to be an [INNER] JOIN:

SELECT t1.*
FROM   table2 t2
JOIN   table1 t1 ON t1.table2_id = t2.id  -- and no parentheses needed
WHERE  t2.is_active  -- that's all
ORDER  BY t1.table2_id;

But there is no need to rewrite your query that way at all. The EXISTS semi-join you have is just as good. Results in the same query plan once you have the partial index.

SELECT *
FROM   table1 t1
WHERE  EXISTS (
   SELECT 1 FROM table2
   WHERE  is_active  -- that's all
   WHERE  id = t1.table2_id
   )
ORDER  BY table2_id;

BTW, since you fixed the bug by upgrading and once you have created that partial index (and run ANALYZE or VACUUM ANALYZE on the table at least once - or autovacuum did that for you), you will never again get a bad query plan for this, since Postgres maintains separate estimates for the partial index, which are unambiguous for your numbers. Details: