Postgres query involving joins and count too slow

93 Views Asked by At

This query below takes way too long to execute

Number of records in each table:

  • stories - 416,355 records
  • event_relations - 24,050,862 records
  • districts - 10 records
  • industries - 100 records
  • industry_districts - 40 records
  • subsectors - 200 records
  • supply - 2,869 records
select s.id, s.description, count(distinct et.id) from 
"district s" s 
inner join 
(
    select id.id from "industry_districts" id
) as id2 on id2.id = s.id
inner join 
(
    select id, industry_districts_id from "industries" i
) as i2 on i2.industry_districts_id = id2.id
inner join 
(
    select id, industries_id from "subsectors" si
) as si2 on si2.industries_id = i2.id
inner join 
(
    select id, subsectors_id from "supplys" rs 
) as rs2 on rs2.subsectors_id = si2.id
inner join 
(
    select id, supplys_id, events_id from "event_relations" ers 
) as ers2 on ers2.supplys_id = rs2.id
left join (
    select id, events_id from "stories"  
) et on et.events_id = ers2.events_id
group by s.id
limit 10 offset 0 

**Explain analyze gives the following result **

Limit  (cost=0.99..20719301.15 rows=10 width=130) (actual time=2107.465..214891.582 rows=10 loops=1)
  ->  GroupAggregate  (cost=0.99..24863161.18 rows=12 width=130) (actual time=2107.462..214891.567 rows=10 loops=1)
        Group Key: s.id
        ->  Nested Loop Left Join  (cost=0.99..23466091.79 rows=279413855 width=126) (actual time=2.073..141429.178 rows=104217089 loops=1)
              ->  Nested Loop  (cost=0.57..607232.85 rows=11644034 width=126) (actual time=2.055..62770.206 rows=8174966 loops=1)
                    ->  Nested Loop  (cost=0.14..3598.24 rows=1389 width=126) (actual time=2.009..56.815 rows=927 loops=1)
                          Join Filter: (si.id = rs.subindustries_id)
                          Rows Removed by Join Filter: 226384
                          ->  Nested Loop  (cost=0.14..117.68 rows=77 width=126) (actual time=1.004..2.502 rows=80 loops=1)
                                Join Filter: (i.id = si.industries_id)
                                Rows Removed by Join Filter: 5309
                                ->  Nested Loop  (cost=0.14..32.60 rows=34 width=126) (actual time=0.513..0.775 rows=34 loops=1)
                                      Join Filter: (ig.id = i.industry_groups_id)
                                      Rows Removed by Join Filter: 671
                                      ->  Nested Loop  (cost=0.14..18.13 rows=12 width=126) (actual time=0.443..0.507 rows=11 loops=1)
                                            Join Filter: (s.id = ig.id)
                                            Rows Removed by Join Filter: 126
                                            ->  Index Scan using "PK_Sectors" on "Sectors" s  (cost=0.14..12.31 rows=12 width=122) (actual time=0.402..0.417 rows=11 loops=1)
                                            ->  Materialize  (cost=0.00..1.38 rows=25 width=4) (actual time=0.002..0.004 rows=12 loops=11)
                                                  ->  Seq Scan on "Industry_Groups" ig  (cost=0.00..1.25 rows=25 width=4) (actual time=0.011..0.019 rows=25 loops=1)
                                      ->  Materialize  (cost=0.00..2.05 rows=70 width=8) (actual time=0.001..0.013 rows=64 loops=11)
                                            ->  Seq Scan on "Industries" i  (cost=0.00..1.70 rows=70 width=8) (actual time=0.007..0.023 rows=70 loops=1)
                                ->  Materialize  (cost=0.00..4.38 rows=159 width=8) (actual time=0.010..0.028 rows=158 loops=34)
                                      ->  Seq Scan on "Subindustries" si  (cost=0.00..3.59 rows=159 width=8) (actual time=0.345..0.386 rows=159 loops=1)
                          ->  Materialize  (cost=0.00..174.03 rows=2869 width=8) (actual time=0.007..0.307 rows=2841 loops=80)
                                ->  Seq Scan on "supplys" rs  (cost=0.00..159.69 rows=2869 width=8) (actual time=0.578..4.098 rows=2869 loops=1)
                    ->  Index Scan using "IXFK_event_relations_supplys" on "event_relations" ers  (cost=0.44..350.07 rows=8451 width=8) (actual time=0.103..65.007 rows=8819 loops=927)
                          Index Cond: (supplys_id = rs.id)
              ->  Index Scan using "IXFK_Event_Tweets_Katalyst_Events" on "Event_Tweets"  (cost=0.42..1.40 rows=56 width=8) (actual time=0.003..0.006 rows=12 loops=8174966)
                    Index Cond: (katalyst_events_id = ers.katalyst_events_id)
Planning Time: 16.839 ms
Execution Time: 214891.868 ms

explain(analyze, verbose, buffers):

Limit  (cost=0.99..20719301.15 rows=10 width=130) (actual time=2751.614..222343.984 rows=10 loops=1)
  Output: s.id, s.description, (count(DISTINCT "stories".id))
  Buffers: shared hit=29275971 read=6562620, temp read=331820 written=332375
  ->  GroupAggregate  (cost=0.99..24863161.18 rows=12 width=130) (actual time=2751.601..222342.173 rows=10 loops=1)
        Output: s.id, s.description, count(DISTINCT "stories".id)
        Group Key: s.id
        Buffers: shared hit=29275971 read=6562620, temp read=331820 written=332375
        ->  Nested Loop Left Join  (cost=0.99..23466091.79 rows=279413855 width=126) (actual time=2.785..147622.650 rows=104217089 loops=1)
              Output: s.id, s.description, "stories".id
              Buffers: shared hit=29275971 read=6562620
              ->  Nested Loop  (cost=0.57..607232.85 rows=11644034 width=126) (actual time=2.755..65731.047 rows=8174966 loops=1)
                    Output: s.id, s.description, ers.events_id
                    Buffers: shared hit=2084415 read=6047838
                    ->  Nested Loop  (cost=0.14..3598.24 rows=1389 width=126) (actual time=2.013..60.815 rows=927 loops=1)
                          Output: s.id, s.description, rs.id
                          Join Filter: (si.id = rs.subdistricts_id)
                          Rows Removed by Join Filter: 226384
                          Buffers: shared hit=5 read=132
                          ->  Nested Loop  (cost=0.14..117.68 rows=77 width=126) (actual time=0.984..2.532 rows=80 loops=1)
                                Output: s.id, s.description, si.id
                                Join Filter: (i.id = si.industries_id)
                                Rows Removed by Join Filter: 5309
                                Buffers: shared hit=5 read=1
                                ->  Nested Loop  (cost=0.14..32.60 rows=34 width=126) (actual time=0.840..1.093 rows=34 loops=1)
                                      Output: s.id, s.description, i.id
                                      Join Filter: (ig.id = i.industry_districts_id)
                                      Rows Removed by Join Filter: 671
                                      Buffers: shared hit=3 read=1
                                      ->  Nested Loop  (cost=0.14..18.13 rows=12 width=126) (actual time=0.759..0.820 rows=11 loops=1)
                                            Output: s.id, s.description, ig.id
                                            Inner Unique: true
                                            Join Filter: (s.id = ig.id)
                                            Rows Removed by Join Filter: 126
                                            Buffers: shared hit=2 read=1
                                            ->  Index Scan using "PK_districts" on public."districts" s  (cost=0.14..12.31 rows=12 width=122) (actual time=0.379..0.392 rows=11 loops=1)
                                                  Output: s.id, s.description, s.active_flag, s.shared_district_id, s.additional_text, s.assets
                                                  Buffers: shared hit=1 read=1
                                            ->  Materialize  (cost=0.00..1.38 rows=25 width=4) (actual time=0.031..0.034 rows=12 loops=11)
                                                  Output: ig.id
                                                  Buffers: shared hit=1
                                                  ->  Seq Scan on public."industry_districts" ig  (cost=0.00..1.25 rows=25 width=4) (actual time=0.043..0.048 rows=25 loops=1)
                                                        Output: ig.id
                                                        Buffers: shared hit=1
                                      ->  Materialize  (cost=0.00..2.05 rows=70 width=8) (actual time=0.003..0.013 rows=64 loops=11)
                                            Output: i.industry_districts_id, i.id
                                            Buffers: shared hit=1
                                            ->  Seq Scan on public."industries" i  (cost=0.00..1.70 rows=70 width=8) (actual time=0.021..0.033 rows=70 loops=1)
                                                  Output: i.industry_districts_id, i.id
                                                  Buffers: shared hit=1
                                ->  Materialize  (cost=0.00..4.38 rows=159 width=8) (actual time=0.001..0.017 rows=158 loops=34)
                                      Output: si.industries_id, si.id
                                      Buffers: shared hit=2
                                      ->  Seq Scan on public."subdistricts" si  (cost=0.00..3.59 rows=159 width=8) (actual time=0.018..0.054 rows=159 loops=1)
                                            Output: si.industries_id, si.id
                                            Buffers: shared hit=2
                          ->  Materialize  (cost=0.00..174.03 rows=2869 width=8) (actual time=0.007..0.315 rows=2841 loops=80)
                                Output: rs.subdistricts_id, rs.id
                                Buffers: shared read=131
                                ->  Seq Scan on public."supplys" rs  (cost=0.00..159.69 rows=2869 width=8) (actual time=0.505..3.618 rows=2869 loops=1)
                                      Output: rs.subdistricts_id, rs.id
                                      Buffers: shared read=131
                    ->  Index Scan using "IXFK_event_relations_supplys" on public."event_relations" ers  (cost=0.44..350.07 rows=8451 width=8) (actual time=0.112..67.905 rows=8819 loops=927)
                          Output: ers.id, ers.events_id, ers.supplys_id, ers.additional_information
                          Index Cond: (ers.supplys_id = rs.id)
                          Buffers: shared hit=2084410 read=6047706
              ->  Index Scan using "IXFK_stories_events" on public."stories"  (cost=0.42..1.40 rows=56 width=8) (actual time=0.003..0.006 rows=12 loops=8174966)
                    Output: "stories".id, "stories".events_id,"stories".published_at
                    Index Cond: ("stories".events_id = ers.events_id)
                    Buffers: shared hit=27191556 read=514782
Planning:
  Buffers: shared hit=57 read=23
Planning Time: 30.368 ms
Execution Time: 222346.279 ms

Indexes created

CREATE INDEX "IXFK_event_relations_events" ON public."event_relations" USING btree (events_id);

CREATE INDEX "IXFK_event_relations_supplys" ON public."event_relations" USING btree (supplys_id);

CREATE UNIQUE INDEX "PK_event_relations" ON public."event_relations" USING btree (id);

CREATE INDEX "IXFK_stories_events" ON public."stories" USING btree (events_id);

CREATE UNIQUE INDEX "PK_stories" ON public."stories" USING btree (id);

CREATE INDEX "IXFK_industries_industry_districts" ON public."industries" USING btree (industry_districts_id);

CREATE UNIQUE INDEX "PK_industries" ON public."industries" USING btree (id);

CREATE INDEX "IXFK_industry_districts_district s" ON public."industry_districts" USING btree (district s_id);

CREATE UNIQUE INDEX "PK_industry_districts" ON public."industry_districts" USING btree (id);

CREATE UNIQUE INDEX "PK_district s" ON public."district s" USING btree (id);

CREATE INDEX "IXFK_subsectors_industries" ON public."subsectors" USING btree (industries_id);

CREATE UNIQUE INDEX "PK_subsectors" ON public."subsectors" USING btree (id);


CREATE TABLE public."events" (
    id serial4 NOT NULL,
    updated_at timestamptz NULL,
    created_at timestamptz NOT NULL,
    title text NOT NULL,
    CONSTRAINT "PK_events" PRIMARY KEY (id)
);

CREATE TABLE public."districts" (
    id serial4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_districts" PRIMARY KEY (id)
);

CREATE TABLE public."stories" (
    id serial4 NOT NULL,
    events_id int4 NOT NULL,
    created_at timestamptz NULL,
    CONSTRAINT "PK_stories" PRIMARY KEY (id),
    CONSTRAINT "FK_stories_events" FOREIGN KEY (events_id) REFERENCES public."events"(id)
);
CREATE INDEX "IXFK_stories_events" ON public."stories" USING btree (events_id);


CREATE TABLE public."industry_districts" (
    id serial4 NOT NULL,
    districts_id int4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_industry_districts" PRIMARY KEY (id),
    CONSTRAINT "FK_industry_districts_districts" FOREIGN KEY (districts_id) REFERENCES public."districts"(id)
);
CREATE INDEX "IXFK_industry_districts_districts" ON public."industry_districts" USING btree (districts_id);


CREATE TABLE public."industries" (
    id serial4 NOT NULL,
    industry_districts_id int4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_industries" PRIMARY KEY (id),
    CONSTRAINT "FK_industries_industry_districts" FOREIGN KEY (industry_districts_id) REFERENCES public."industry_districts"(id)
);
CREATE INDEX "IXFK_industries_industry_districts" ON public."industries" USING btree (industry_districts_id);


CREATE TABLE public."subsector" (
    id serial4 NOT NULL,
    industries_id int4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_subsector" PRIMARY KEY (id),
    CONSTRAINT "FK_subsector_industries" FOREIGN KEY (industries_id) REFERENCES public."industries"(id)
);
CREATE INDEX "IXFK_subsector_industries" ON public."subsector" USING btree (industries_id);


CREATE TABLE public."supplys" (
    id serial4 NOT NULL,
    subsector_id int4 NOT NULL,
    description text NOT NULL,
    CONSTRAINT "PK_supplys" PRIMARY KEY (id),
    CONSTRAINT "FK_supplys_subsector" FOREIGN KEY (subsector_id) REFERENCES public."subsector"(id)
);
CREATE INDEX "IXFK_supplys_subsector" ON public."supplys" USING btree (subsector_id);


CREATE TABLE public."event_relations" (
    id serial4 NOT NULL,
    events_id int4 NOT NULL,
    supplys_id int4 NOT NULL,
    CONSTRAINT "PK_event_relations" PRIMARY KEY (id),
    CONSTRAINT "FK_event_relations_events" FOREIGN KEY (events_id) REFERENCES public."events"(id),
    CONSTRAINT "FK_event_relations_supplys" FOREIGN KEY (supplys_id) REFERENCES public."supplys"(id)
);
CREATE INDEX "IXFK_event_relations_events" ON public."event_relations" USING btree (events_id);
CREATE INDEX "IXFK_event_relations_supplys" ON public."event_relations" USING btree (supplys_id);

I am using postgres version 14.0

How do I improve the performance of this query?

2

There are 2 best solutions below

16
On

You could get rid of the subselects and allow PostgreSQL to use its internal optimizations while joining:

SELECT s.id, s.description, count(DISTINCT et.id)
  FROM "district s" s
  INNER JOIN "industry_districts" id2
    ON id2.id = s.id
  INNER JOIN "industries" i2
    ON i2.industry_districts_id = id2.id
  INNER JOIN "subsectors"  si2
    ON si2.industries_id = i2.id
  INNER JOIN "supplys" rs2
    ON rs2.subsectors_id = si2.id
  INNER JOIN "event_relations" ers2
    ON ers2.supplys_id = rs2.id
  LEFT JOIN "event_stories" et
    ON et.events_id = ers2.events_id
  GROUP BY s.id
  LIMIT 10 OFFSET 0;

I have no contextual knowledge about your query, so excuse me if I have made a typo.

EDIT

You could run a cron job that runs (and waits for the result) periodically based on the query above and store the results somewhere (cache, server files, table or materialized view, it's your choice).

Then, whenever the users are trying to access these values, they can reach out to the temporarily stored results rather than waiting for the results each time they access them.

1
On

The following equivalent query should run much faster:

SELECT s.id, s.description, count(DISTINCT et.id)
  FROM (SELECT id, description
          FROM "district s"
          GROUP BY id
          LIMIT 10 OFFSET 0) s
  INNER JOIN "industry_districts" id2
    ON id2.id = s.id
  INNER JOIN "industries" i2
    ON i2.industry_districts_id = id2.id
  INNER JOIN "subsectors"  si2
    ON si2.industries_id = i2.id
  INNER JOIN "supplys" rs2
    ON rs2.subsectors_id = si2.id
  INNER JOIN "event_relations" ers2
    ON ers2.supplys_id = rs2.id
  LEFT JOIN "event_stories" et
    ON et.events_id = ers2.events_id;

The returned rows are indeterminate in this query, and the one from the original post, because they lack an ORDER BY clause before the LIMIT and OFFSET.