I want to speed up the following sql (cost is 19685.75). Can I index this sql which have multiple complex nested AND conditions combining with OR in the WHERE statement?
SELECT DISTINCT
ON ("crawler_url"."url") U0."id"
FROM "characteristics_text" U0 LEFT OUTER
JOIN "characteristics_text_urls"
ON (U0."id" = "characteristics_text_urls"."text_id") LEFT OUTER
JOIN "crawler_url"
ON ("characteristics_text_urls"."url_id" = "crawler_url"."id")
WHERE (
(
U0."publication_date" BETWEEN '2018-01-01' AND '2018-12-31'
AND EXTRACT('month' FROM U0."publication_date") = 10
)
OR
(
U0."publication_date" IS NULL
AND U0."lastmod" BETWEEN '2018-01-01' AND '2018-12-31'
AND EXTRACT('month' FROM U0."lastmod") = 10
)
OR
(
U0."publication_date" IS NULL
AND U0."lastmod" IS NULL
AND U0."created_at" BETWEEN '2018-01-01 00:00:00+08:00' AND '2018-12-31 23:59:59.999999+08:00'
AND EXTRACT('month' FROM U0."created_at" AT TIME ZONE 'Asia/Hong_Kong') = 10
)
OR
(
U0."publication_date" >= '2018-08-01'
AND U0."publication_date" < '2018-10-31'
)
OR
(
U0."publication_date" IS NULL
AND U0."lastmod" >='2018-08-01'
AND U0."lastmod" < '2018-10-31'
)
OR
(
U0."publication_date" IS NULL
AND U0."lastmod" IS NULL
AND U0."created_at" >= '2018-07-31 16:00:00+00:00'
AND U0."created_at" < '2018-10-30 16:00:00+00:00'
)
)
ORDER BY "crawler_url"."url" ASC, U0."created_at" DESC
The table text contains the following fields and indexs (some other fields are not shown)
Table "public.characteristics_text"
Column | Type | Modifiers
------------------------+--------------------------+-------------------------------------------------------------------
id | integer | not null default nextval('characteristics_text_id_seq'::regclass)
text | text |
created_at | timestamp with time zone | not null
lastmod | date |
publication_date | date |
Indexes:
"characteristics_text_pkey" PRIMARY KEY, btree (id)
"characteristics_text_fde81f11" btree (created_at)
"characteristics_text_lastmod_3bff34c2_uniq" btree (lastmod)
"characteristics_text_publication_date_772c1bda_uniq" btree (publication_date)
"characteristics_text_publication_date_c6311385_idx" btree (publication_date, lastmod, created_at)
I added three single indexes for created_at, lastmod and publication_date; and one multiple-column index for these fields.
But in postgres EXPAIN query, this where clause is still using Seq Scan but not Index Scan.
-> Seq Scan on characteristics_text u0 (cost=0.00..19685.75 rows=14535 width=12)
Filter: (
(
(publication_date >= '2018-01-01'::date) AND
(publication_date <= '2018-12-31'::date) AND
(
date_part(
'month'::text, (publication_date)::timestamp without time zone
) = 10::double precision)
) OR
((publication_date IS NULL) AND (lastmod >= '2018-01-01'::date) AND (lastmod <= '2018-12-31'::date) AND (date_part('month'::text, (lastmod)::timestamp without time zone) = 10::double precision)) OR ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2017-12-31 16:00:00+00'::timestamp with time zone) AND (created_at <= '2018-12-31 15:59:59.999999+00'::timestamp with time zone) AND (date_part('month'::text, timezone('Asia/Hong_Kong'::text, created_at)) = 10::double precision)) OR ((publication_date >= '2018-08-01'::date) AND (publication_date < '2018-10-31'::date)) OR ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date)) OR ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2018-07-31 16:00:00+00'::timestamp with time zone) AND (created_at < '2018-10-30 16:00:00+00'::timestamp with time zone))
)
My questions are:
1. Is it possible to make postgres to use Index scan for this complex SELECT clause?
2. Do I need to create one multiple-column index for each AND clauses? For example creating a index (publication_date, lastmod)
in this causes?
(
U0."publication_date" IS NULL
AND U0."lastmod" BETWEEN '2018-01-01' AND '2018-12-31'
AND EXTRACT('month' FROM U0."lastmod") = 10
)
- Does index works on searching IS NULL? Do field searching for IS NULL need to be index?
UPDATED 4nov2018
When I try minimize the query by testing the fields one by one, fields publication_date
and last_mod
trigger index scan individually, while created_at
can't:
Is it because created_at
is timestamp? But how come index not work for timestamp?
explain SELECT DISTINCT
ON ("crawler_url"."url") U0."id"
FROM "characteristics_text" U0 LEFT OUTER
JOIN "characteristics_text_urls"
ON (U0."id" = "characteristics_text_urls"."text_id") LEFT OUTER
JOIN "crawler_url"
ON ("characteristics_text_urls"."url_id" = "crawler_url"."id")
WHERE (
(
U0."created_at" BETWEEN '2018-01-01 00:00:00+08:00' AND '2018-12-31 23:59:59.999999+08:00'
AND EXTRACT('month' FROM U0."created_at" AT TIME ZONE 'Asia/Hong_Kong') = 10
)
)
ORDER BY "crawler_url"."url" ASC, U0."created_at" DESC
Unique (cost=18004.05..18006.01 rows=393 width=86)
-> Sort (cost=18004.05..18005.03 rows=393 width=86)
Sort Key: crawler_url.url, u0.created_at
-> Nested Loop Left Join (cost=0.71..17987.11 rows=393 width=86)
-> Nested Loop Left Join (cost=0.42..17842.25 rows=393 width=16)
-> Seq Scan on characteristics_text u0 (cost=0.00..15467.37 rows=393 width=12)
Filter: ((created_at >= '2017-12-31 16:00:00+00'::timestamp with time zone) AND (created_at <= '2018-12-31 15:59:59.999999+00'::timestamp with time zone) AND (date_part('month'::text, timezone('Asia/Hong_Kong'::text, created_at)) = 10::double precision))
-> Index Scan using characteristics_text_urls_65eb77fe on characteristics_text_urls (cost=0.42..6.03 rows=1 width=8)
Index Cond: (u0.id = text_id)
-> Index Scan using crawler_url_pkey on crawler_url (cost=0.29..0.36 rows=1 width=78)
Index Cond: (characteristics_text_urls.url_id = id)
publication_date
seems trigger index scan:
(
U0."publication_date" IS NULL
AND U0."lastmod" >='2018-08-01'
AND U0."lastmod" < '2018-10-31'
)
Unique (cost=17053.26..17085.63 rows=6473 width=86)
-> Sort (cost=17053.26..17069.44 rows=6473 width=86)
Sort Key: crawler_url.url, u0.created_at
-> Nested Loop Left Join (cost=11130.73..16643.51 rows=6473 width=86)
-> Hash Right Join (cost=11130.44..14257.63 rows=6473 width=16)
Hash Cond: (characteristics_text_urls.text_id = u0.id)
-> Seq Scan on characteristics_text_urls (cost=0.00..1858.01 rows=120601 width=8)
-> Hash (cost=11049.53..11049.53 rows=6473 width=12)
-> Bitmap Heap Scan on characteristics_text u0 (cost=186.95..11049.53 rows=6473 width=12)
Recheck Cond: ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date))
-> Bitmap Index Scan on characteristics_text_publication_date_c6311385_idx (cost=0.00..185.33 rows=6473 width=0)
Index Cond: ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date))
-> Index Scan using crawler_url_pkey on crawler_url (cost=0.29..0.36 rows=1 width=78)
Index Cond: (characteristics_text_urls.url_id = id)
OK, a Full Table Scan (seq_scan) can actually be faster than multiple Index Scans. It depends on the specific "selectivity" of your filtering conditions.
First of all your
WHERE
clause has six filtering conditions that areOR
ed. That means that if you wanted to use indexes, PostgreSQL would need to use it 6 times, and then perform a "Index OR" to merge the results. That may not be cheap.So first, you need to know what's the expected selectivity of each one of the 6 filtering conditions. This is, how many rows are selected, compared to the total number of rows the table has. Do it; a few simple SQL queries will give you the answer. Post the answer here.
Now, if the sum of all six selectivities is more than 5%, then a Full Table Scan (the algorithm you have now) is faster. Don't bother with indexes.
Otherwise, the following index can help: