When does PostgreSQL collapse subqueries to joins and when not?

Question

When does PostgreSQL collapse subqueries to joins and when not?

747 Views Asked by tbz At 17 August 2025 at 18:07

Considering the following query:

select a.id from a
where
    a.id in (select b.a_id from b where b.x='x1' and b.y='y1') and
    a.id in (select b.a_id from b where b.x='x2' and b.y='y2')
order by a.date desc
limit 20

Which should be rewritable to that faster one:

select a.id from a inner join b as b1 on (a.id=b1.a_id) inner join b as b2 on (a.id=b2.a_id)
where
    b1.x='x1' and b1.y='y1' and
    b2.x='x2' and b2.y='y2'
order by a.date desc
limit 20

We would prefer not to rewrite our queries by changing our source code as it complicates a lot (especially when using Django).

Thus, we wonder when PostgreSQL collapses subqueries to joins and when not?

That is the simplified data model:

                                      Table "public.a"
      Column       |          Type          |                          Modifiers
-------------------+------------------------+-------------------------------------------------------------
 id                | integer                | not null default nextval('a_id_seq'::regclass)
 date              | date                   | 
 content           | character varying(256) | 
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
    "a_id_date" btree (id, date)
Referenced by:
    TABLE "b" CONSTRAINT "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED


       Table "public.b"
  Column  |   Type    | Modifiers 
----------+-----------+-----------
 a_id     | integer   | not null
 x        | text      | not null
 y        | text      | not null
Indexes:
    "b_x_y_a_id" UNIQUE CONSTRAINT, btree (x, y, a_id)
Foreign-key constraints:
    "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED

a has 7 million rows
b has 70 million rows
cardinality of b.x = ~100
cardinality of b.y = ~100000
cardinality of b.x, b.y = ~150000
imagine tables c, d and e that have the same structure as b and could be used additionally to further reduce the resulting a.ids

Versions of PostgreSQL, we tested the queries.

PostgreSQL 9.2.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit
PostgreSQL 9.4beta1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit

Query Plans (with empty file cache and mem cache):

Original Q&A

There are 2 best solutions below

**Denis de Bernardy** · Answer 1

Your last comment nails the reason, I think: The two queries are not equivalent unless a unique constraint kicks in to make them equivalent.

Example of an equivalent schema:

denis=# \d a
                         Table "public.a"
 Column |  Type   |                   Modifiers                    
--------+---------+------------------------------------------------
 id     | integer | not null default nextval('a_id_seq'::regclass)
 d      | date    | not null
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

denis=# \d b
       Table "public.b"
 Column |  Type   | Modifiers 
--------+---------+-----------
 a_id   | integer | not null
 val    | integer | not null
Foreign-key constraints:
    "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

Equivalent offending data using that schema:

denis=# select * from a order by d;
 id |     d      
----+------------
  1 | 2014-12-10
  2 | 2014-12-11
  3 | 2014-12-12
  4 | 2014-12-13
  5 | 2014-12-14
  6 | 2014-12-15
(6 rows)

denis=# select * from b order by a_id, val;
 a_id | val 
------+-----
    1 |   1
    1 |   1
    2 |   1
    2 |   1
    2 |   2
    3 |   1
    3 |   1
    3 |   2
(8 rows)

Rows using two IN clauses:

denis=# select a.id, a.d from a where a.id in (select b.a_id from b where b.val = 1) and a.id in (select b.a_id from b where b.val = 2) order by d;
 id |     d      
----+------------
  2 | 2014-12-11
  3 | 2014-12-12
(2 rows)

Rows using two joins:

denis=# select a.id, a.d from a join b b1 on a.id = b1.a_id join b b2 on a.id = b2.a_id where b1.val = 1 and b2.val = 2 order by d;
 id |     d      
----+------------
  2 | 2014-12-11
  2 | 2014-12-11
  3 | 2014-12-12
  3 | 2014-12-12
(4 rows)

I see you've a unique constraint on b (a_id, x, y) already, though. Perhaps highlight the issue to the Postgres performance list to get the reason why it's not collapsed in your particular case -- or at least not generating the exact same plan.

**wildplasser** · Answer 2

        -- The table definitions
CREATE TABLE table_a (
        id     SERIAL NOT NULL PRIMARY KEY
        , d      DATE NOT NULL
        );

CREATE TABLE table_b (
        id     SERIAL NOT NULL PRIMARY KEY
        , a_id INTEGER NOT NULL REFERENCES table_a(id)
        , x VARCHAR NOT NULL
        , y VARCHAR NOT NULL
        );
        -- fake some data
INSERT INTO table_a(d)
SELECT gs
FROM generate_series( '1904-01-01'::timestamp ,'2015-01-01'::timestamp, '1 day'::interval) gs;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x1' , 'y1' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x2' , 'y2' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x3' , 'y3' FROM table_a a;
DELETE FROM table_b WHERE RANDOM() > 0.3;

CREATE UNIQUE INDEX ON table_a(d, id);  -- date first
CREATE INDEX ON table_b(a_id);          -- supporting the FK

        -- For initialising the statistics
VACUUM ANALYZE table_a;
VACUUM ANALYZE table_b;

        -- original query
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x1' AND b.y='y1')
  AND a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

        -- EXISTS() version
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x1' AND b.y='y1')
  AND EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

Resulting query plan:

 Limit  (cost=0.87..491.23 rows=20 width=8) (actual time=0.080..0.521 rows=20 loops=1)
   ->  Nested Loop Semi Join  (cost=0.87..15741.40 rows=642 width=8) (actual time=0.080..0.518 rows=20 loops=1)
         ->  Nested Loop Semi Join  (cost=0.58..14380.54 rows=4043 width=12) (actual time=0.017..0.391 rows=74 loops=1)
               ->  Index Only Scan Backward using table_a_d_id_idx on table_a a  (cost=0.29..732.75 rows=40544 width=8) (actual time=0.008..0.048 rows=231 loops=1)
                     Heap Fetches: 0
               ->  Index Scan using table_b_a_id_idx on table_b b_1  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=231)
                     Index Cond: (a_id = a.id)
                     Filter: (((x)::text = 'x2'::text) AND ((y)::text = 'y2'::text))
                     Rows Removed by Filter: 0
         ->  Index Scan using table_b_a_id_idx on table_b b  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=74)
               Index Cond: (a_id = a.id)
               Filter: (((x)::text = 'x1'::text) AND ((y)::text = 'y1'::text))
               Rows Removed by Filter: 1
 Total runtime: 0.547 ms

The two queries cause exactly the same query plan and results (because of the NOT NULL on tableb.a_id )
The index table_b(a_id) is absolutely necessary once you prefer index joins over hashjoins (with 7M//70M tuples, I think you should prefer index scans)
The sort (expensive) in the outer query has been avoided (using the index table_a(d, id) )

When does PostgreSQL collapse subqueries to joins and when not?

There are 2 best solutions below

Related Questions in POSTGRESQL

Related Questions in JOIN

Related Questions in IN-SUBQUERY

Trending Questions

Popular # Hahtags

Popular Questions