How to Index SQL with multiple AND conditions nested inside OR

928 Views Asked by At

I want to speed up the following sql (cost is 19685.75). Can I index this sql which have multiple complex nested AND conditions combining with OR in the WHERE statement?

SELECT DISTINCT
    ON ("crawler_url"."url") U0."id"
FROM "characteristics_text" U0 LEFT OUTER
JOIN "characteristics_text_urls"
    ON (U0."id" = "characteristics_text_urls"."text_id") LEFT OUTER
JOIN "crawler_url"
    ON ("characteristics_text_urls"."url_id" = "crawler_url"."id")
WHERE ( 
    (
        U0."publication_date" BETWEEN '2018-01-01' AND '2018-12-31'
        AND EXTRACT('month' FROM U0."publication_date") = 10
    )
        OR 
    (
        U0."publication_date" IS NULL
        AND U0."lastmod" BETWEEN '2018-01-01' AND '2018-12-31'
        AND EXTRACT('month' FROM U0."lastmod") = 10
    )
        OR 
    (
        U0."publication_date" IS NULL
        AND U0."lastmod" IS NULL
        AND U0."created_at" BETWEEN '2018-01-01 00:00:00+08:00' AND '2018-12-31 23:59:59.999999+08:00'
        AND EXTRACT('month' FROM U0."created_at" AT TIME ZONE 'Asia/Hong_Kong') = 10
    )
        OR 
    (
        U0."publication_date" >= '2018-08-01'
        AND U0."publication_date" < '2018-10-31'
    )
        OR 
    (
        U0."publication_date" IS NULL
        AND U0."lastmod" >='2018-08-01'
        AND U0."lastmod" < '2018-10-31'
    )
        OR 
    (
        U0."publication_date" IS NULL
        AND U0."lastmod" IS NULL
        AND U0."created_at" >= '2018-07-31 16:00:00+00:00'
        AND U0."created_at" < '2018-10-30 16:00:00+00:00'
    ) 
)
ORDER BY  "crawler_url"."url" ASC, U0."created_at" DESC

The table text contains the following fields and indexs (some other fields are not shown)

                                            Table "public.characteristics_text"                                                                                                     
            Column         |           Type           |                             Modifiers                                                                                        
    ------------------------+--------------------------+-------------------------------------------------------------------                                                           
    id                     | integer                  | not null default nextval('characteristics_text_id_seq'::regclass)
    text                   | text                     | 
    created_at             | timestamp with time zone | not null
    lastmod                | date                     | 
    publication_date       | date                     | 
    Indexes:
        "characteristics_text_pkey" PRIMARY KEY, btree (id)
        "characteristics_text_fde81f11" btree (created_at)
        "characteristics_text_lastmod_3bff34c2_uniq" btree (lastmod)
        "characteristics_text_publication_date_772c1bda_uniq" btree (publication_date)
        "characteristics_text_publication_date_c6311385_idx" btree (publication_date, lastmod, created_at)

I added three single indexes for created_at, lastmod and publication_date; and one multiple-column index for these fields.

But in postgres EXPAIN query, this where clause is still using Seq Scan but not Index Scan.

->  Seq Scan on characteristics_text u0  (cost=0.00..19685.75 rows=14535 width=12)
    Filter: (
            (
                (publication_date >= '2018-01-01'::date) AND 
                (publication_date <= '2018-12-31'::date) AND 
                (
                        date_part(
                            'month'::text, (publication_date)::timestamp without time zone
                ) = 10::double precision)
            ) OR 

                ((publication_date IS NULL) AND (lastmod >= '2018-01-01'::date) AND (lastmod <= '2018-12-31'::date) AND (date_part('month'::text, (lastmod)::timestamp without time zone) = 10::double precision)) OR ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2017-12-31 16:00:00+00'::timestamp with time zone) AND (created_at <= '2018-12-31 15:59:59.999999+00'::timestamp with time zone) AND (date_part('month'::text, timezone('Asia/Hong_Kong'::text, created_at)) = 10::double precision)) OR ((publication_date >= '2018-08-01'::date) AND (publication_date < '2018-10-31'::date)) OR ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date)) OR ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2018-07-31 16:00:00+00'::timestamp with time zone) AND (created_at < '2018-10-30 16:00:00+00'::timestamp with time zone))
)

My questions are:
1. Is it possible to make postgres to use Index scan for this complex SELECT clause?
2. Do I need to create one multiple-column index for each AND clauses? For example creating a index (publication_date, lastmod) in this causes?

    (
        U0."publication_date" IS NULL
        AND U0."lastmod" BETWEEN '2018-01-01' AND '2018-12-31'
        AND EXTRACT('month' FROM U0."lastmod") = 10
    )
  1. Does index works on searching IS NULL? Do field searching for IS NULL need to be index?

UPDATED 4nov2018

When I try minimize the query by testing the fields one by one, fields publication_date and last_mod trigger index scan individually, while created_at can't:

Is it because created_at is timestamp? But how come index not work for timestamp?

explain SELECT DISTINCT
    ON ("crawler_url"."url") U0."id"
FROM "characteristics_text" U0 LEFT OUTER
JOIN "characteristics_text_urls"
    ON (U0."id" = "characteristics_text_urls"."text_id") LEFT OUTER
JOIN "crawler_url"
    ON ("characteristics_text_urls"."url_id" = "crawler_url"."id")
WHERE ( 
(
        U0."created_at" BETWEEN '2018-01-01 00:00:00+08:00' AND '2018-12-31 23:59:59.999999+08:00'
        AND EXTRACT('month' FROM U0."created_at" AT TIME ZONE 'Asia/Hong_Kong') = 10
    )   
)
ORDER BY  "crawler_url"."url" ASC, U0."created_at" DESC



Unique  (cost=18004.05..18006.01 rows=393 width=86)
->  Sort  (cost=18004.05..18005.03 rows=393 width=86)
        Sort Key: crawler_url.url, u0.created_at
        ->  Nested Loop Left Join  (cost=0.71..17987.11 rows=393 width=86)
            ->  Nested Loop Left Join  (cost=0.42..17842.25 rows=393 width=16)
                    ->  Seq Scan on characteristics_text u0  (cost=0.00..15467.37 rows=393 width=12)
                        Filter: ((created_at >= '2017-12-31 16:00:00+00'::timestamp with time zone) AND (created_at <= '2018-12-31 15:59:59.999999+00'::timestamp with time zone) AND (date_part('month'::text, timezone('Asia/Hong_Kong'::text, created_at)) = 10::double precision))
                    ->  Index Scan using characteristics_text_urls_65eb77fe on characteristics_text_urls  (cost=0.42..6.03 rows=1 width=8)
                        Index Cond: (u0.id = text_id)
            ->  Index Scan using crawler_url_pkey on crawler_url  (cost=0.29..0.36 rows=1 width=78)
                    Index Cond: (characteristics_text_urls.url_id = id)

publication_date seems trigger index scan:

(
    U0."publication_date" IS NULL
    AND U0."lastmod" >='2018-08-01'
    AND U0."lastmod" < '2018-10-31'
)


Unique  (cost=17053.26..17085.63 rows=6473 width=86)
->  Sort  (cost=17053.26..17069.44 rows=6473 width=86)
        Sort Key: crawler_url.url, u0.created_at
        ->  Nested Loop Left Join  (cost=11130.73..16643.51 rows=6473 width=86)
            ->  Hash Right Join  (cost=11130.44..14257.63 rows=6473 width=16)
                    Hash Cond: (characteristics_text_urls.text_id = u0.id)
                    ->  Seq Scan on characteristics_text_urls  (cost=0.00..1858.01 rows=120601 width=8)
                    ->  Hash  (cost=11049.53..11049.53 rows=6473 width=12)
                        ->  Bitmap Heap Scan on characteristics_text u0  (cost=186.95..11049.53 rows=6473 width=12)
                                Recheck Cond: ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date))
                                ->  Bitmap Index Scan on characteristics_text_publication_date_c6311385_idx  (cost=0.00..185.33 rows=6473 width=0)
                                    Index Cond: ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date))
            ->  Index Scan using crawler_url_pkey on crawler_url  (cost=0.29..0.36 rows=1 width=78)
                    Index Cond: (characteristics_text_urls.url_id = id)
3

There are 3 best solutions below

1
On

OK, a Full Table Scan (seq_scan) can actually be faster than multiple Index Scans. It depends on the specific "selectivity" of your filtering conditions.

First of all your WHERE clause has six filtering conditions that are ORed. That means that if you wanted to use indexes, PostgreSQL would need to use it 6 times, and then perform a "Index OR" to merge the results. That may not be cheap.

So first, you need to know what's the expected selectivity of each one of the 6 filtering conditions. This is, how many rows are selected, compared to the total number of rows the table has. Do it; a few simple SQL queries will give you the answer. Post the answer here.

Now, if the sum of all six selectivities is more than 5%, then a Full Table Scan (the algorithm you have now) is faster. Don't bother with indexes.

Otherwise, the following index can help:

create index ix1 on characteristics_text (
  publication_date, 
  lastmod,
  created_at,
  1);
0
On

I doubt that you are going to get an index that is useful here. What you might consider doing is breaking this query into 4 or 5 parts, and then using UNION to stick the results together again. (UNION will remove dupes, WHILE UNION ALL will return all rows).

UNION is a fairly expensive operation, so there is some consideration in how many rows this returns. Using the index may buy more efficiency than the UNION loses if it removes a sufficient number of rows. If many rows are returned, your current form is about as good as it is going to get.

0
On

The whole year of 2018 has 60% out of 100,000 records, which made the database use seq scan. Changing from BEWEEN the whole year to just one month gives index scan.

  AND U0."created_at" >= '2018-10-01 00:00:00+00:00'
    AND U0."created_at" <= '2018-10-31 23:59:59.999999+00:00')

Full SQL:

SELECT DISTINCT
    ON ("crawler_url"."url") U0."id"
FROM "characteristics_text" U0 LEFT OUTER
JOIN "characteristics_text_urls"
    ON (U0."id" = "characteristics_text_urls"."text_id") LEFT OUTER
JOIN "crawler_url"
    ON ("characteristics_text_urls"."url_id" = "crawler_url"."id")
WHERE (
        (U0."publication_date" >= '2018-10-01'
        AND U0."publication_date" <= '2018-11-01')

        OR (U0."publication_date" IS NULL
        AND U0."lastmod" >= '2018-10-01'
        AND U0."lastmod" <= '2018-11-01'
        )

        OR 

        (U0."publication_date" IS NULL
        AND U0."lastmod" IS NULL
        AND U0."created_at" >= '2018-10-01 00:00:00+00:00'
        AND U0."created_at" <= '2018-10-31 23:59:59.999999+00:00')

        OR 

        (U0."publication_date" >= '2018-08-01'
        AND U0."publication_date" < '2018-10-31')

        OR 

        (U0."publication_date" IS NULL
        AND U0."lastmod" >= '2018-08-01'
        AND U0."lastmod" < '2018-10-31')

        OR 

        (U0."publication_date" IS NULL
        AND U0."lastmod" IS NULL
        AND U0."created_at" >= '2018-07-31 16:00:00+00:00'
        AND U0."created_at" < '2018-10-30 16:00:00+00:00')
    )
ORDER BY  "crawler_url"."url" ASC

EXPLAIN statement shows index scan for each AND conditions, so total have 6 index scans.

Unique  (cost=22885.16..22962.39 rows=15446 width=88)
->  Sort  (cost=22885.16..22923.77 rows=15446 width=88)
        Sort Key: crawler_url.url
        ->  Hash Right Join  (cost=18669.29..21068.51 rows=15446 width=88)
            Hash Cond: (crawler_url.id = characteristics_text_urls.url_id)
            ->  Seq Scan on crawler_url  (cost=0.00..1691.88 rows=55288 width=88)
            ->  Hash  (cost=18476.21..18476.21 rows=15446 width=8)
                    ->  Hash Right Join  (cost=14982.09..18476.21 rows=15446 width=8)
                        Hash Cond: (characteristics_text_urls.text_id = u0.id)
                        ->  Seq Scan on characteristics_text_urls  (cost=0.00..1907.25 rows=115525 width=8)
                        ->  Hash  (cost=14789.01..14789.01 rows=15446 width=4)
                                ->  Bitmap Heap Scan on characteristics_text u0  (cost=516.57..14789.01 rows=15446 width=4)
                                    Recheck Cond: (((publication_date >= '2018-10-01'::date) AND (publication_date <= '2018-11-01'::date)) OR ((publication_date IS NULL) AND (lastmod >= '2018-10-01'::date) AND (lastmod <= '2018-11-01'::date)) OR ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND (created_at <= '2018-10-31 23:59:59.999999+00'::timestamp with time zone)) OR ((publication_date >= '2018-08-01'::date) AND (publication_date < '2018-10-31'::date)) OR ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date)) OR ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2018-07-31 16:00:00+00'::timestamp with time zone) AND (created_at < '2018-10-30 16:00:00+00'::timestamp with time zone)))
                                    ->  BitmapOr  (cost=516.57..516.57 rows=16081 width=0)
                                            ->  Bitmap Index Scan on characteristics_text_publication_date_772c1bda_uniq  (cost=0.00..4.53 rows=11 width=0)
                                                Index Cond: ((publication_date >= '2018-10-01'::date) AND (publication_date <= '2018-11-01'::date))
                                            ->  Bitmap Index Scan on characteristics_text_publication_date_c6311385_idx  (cost=0.00..6.49 rows=166 width=0)
                                                Index Cond: ((publication_date IS NULL) AND (lastmod >= '2018-10-01'::date) AND (lastmod <= '2018-11-01'::date))
                                            ->  Bitmap Index Scan on characteristics_text_publication_date_c6311385_idx  (cost=0.00..14.61 rows=413 width=0)
                                                Index Cond: ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND (created_at <= '2018-10-31 23:59:59.999999+00'::timestamp with time zone))
                                            ->  Bitmap Index Scan on characteristics_text_publication_date_772c1bda_uniq  (cost=0.00..74.61 rows=3419 width=0)
                                                Index Cond: ((publication_date >= '2018-08-01'::date) AND (publication_date < '2018-10-31'::date))
                                            ->  Bitmap Index Scan on characteristics_text_publication_date_c6311385_idx  (cost=0.00..108.20 rows=3503 width=0)
                                                Index Cond: ((publication_date IS NULL) AND (lastmod >= '2018-08-01'::date) AND (lastmod < '2018-10-31'::date))
                                            ->  Bitmap Index Scan on characteristics_text_publication_date_c6311385_idx  (cost=0.00..284.95 rows=8569 width=0)
                                                Index Cond: ((publication_date IS NULL) AND (lastmod IS NULL) AND (created_at >= '2018-07-31 16:00:00+00'::timestamp with time zone) AND (created_at < '2018-10-30 16:00:00+00'::timestamp with time zone))