In my application it needs to get the id of a table in a specific data. The id is not ordered and is not predictable, so the query uses IN. Let's say it has a lot of id's inside like 100.000 ids
The query look like this and it takes approx 10 seconds
SELECT col1, col2, col3, ...
FROM table
WHERE created_at BETWEEN date1 AND date2
AND table_id IN (1, ..., 100000)
When I EXPLAIN ANALYZE this is the return
Gather (cost=1002.79..3322893.01 rows=823578 width=24) (actual time=761.500..5453.013 rows=219908 loops=1)
Workers Planned: 2
Workers Launched: 2
Planning Time: 2.433 ms
Execution Time: 5463.276 ms
The query look like this and it takes approx 4 seconds
SELECT col1, col2, col3, ...
FROM table
WHERE created_at BETWEEN date1 AND date2
AND table_id = ANY (
SELECT value
FROM UNNEST(ARRAY([1,...... 100000)]) as value ))
here's the EXPLAIN ANALYZE
Nested Loop (cost=14.52..2879232.45 rows=872265 width=24) (actual time=0.485..548.387 rows=219908 loops=1)
Planning Time: 0.938 ms
Execution Time: 558.358 ms
I have indexed the table_id and created_at using BTREE
My question is, how to make the query faster ? , is my second query already the most efficient ? and if it's the most efficient, is there any trade off I'm not aware or it has any bug when using ANY (SELECT value FROM UNNEST(ARRAY[])) ?
UPDATED
EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS ) First Query
QUERY PLAN
Gather (cost=1002.79..3324157.83 rows=823891 width=24) (actual time=2059.074..5390.528 rows=219908 loops=1)
Output: col1, col2, col3
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=315848 read=522241 dirtied=35
I/O Timings: read=1103.008
-> Parallel Seq Scan on table_name (cost=2.79..3240768.73 rows=343288 width=24) (actual time=2056.255..5363.859 rows=73303 loops=3)
Output: col1, col2, col3
Filter: (table_id = ANY ('{1,......, 100000}'::integer[])) AND (created_at >= date1::timestamp with time zone) AND (created_at <= date2::timestamp with time zone)
Rows Removed by Filter: 33444492
Buffers: shared hit=315848 read=522241 dirtied=35
I/O Timings: read=1103.008
Worker 0: actual time=2054.371..5372.919 rows=69788 loops=1
Buffers: shared hit=105747 read=174198 dirtied=3
I/O Timings: read=369.176
Worker 1: actual time=2055.867..5369.968 rows=78986 loops=1
Buffers: shared hit=105794 read=173495
I/O Timings: read=367.726
Settings: effective_cache_size = '32568248kB', jit = 'off', search_path = 'public'
Query Identifier: -2771525684851677483
Planning Time: 1.892 ms
Execution Time: 5403.505 ms
EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS ) Second Query
QUERY PLAN
Nested Loop (cost=14.52..2880414.99 rows=872605 width=24) (actual time=0.432..553.665 rows=219908 loops=1)
Output: col1, col2, col3
Buffers: shared hit=392247 dirtied=8
-> HashAggregate (cost=13.95..15.95 rows=200 width=4) (actual time=0.407..0.849 rows=1116 loops=1)
Output: value.value
Group Key: value.value
Batches: 1 Memory Usage: 145kB
-> Function Scan on pg_catalog.unnest value (cost=0.00..11.16 rows=1116 width=4) (actual time=0.071..0.143 rows=1116 loops=1)
Output: value.value
Function Call: unnest('{1,....,100000}'::integer[])
-> Index Scan using table_name_table_id_created_at_idx on table_name (cost=0.57..14328.73 rows=782 width=16) (actual time=0.047..0.409 rows=197 loops=1116)
Output: col1, col2, col3
Index Cond: (ss.vehicle_id = value.value)
Filter: (created_at >= date1::timestamp with time zone) AND (created_at <= date2::timestamp with time zone)
Rows Removed by Filter: 229
Buffers: shared hit=392247 dirtied=8
Settings: effective_cache_size = '32568248kB', jit = 'off', search_path = 'public'
Query Identifier: 328163023760902820
Planning Time: 0.374 ms
Execution Time: 566.648 ms
Create test data:
Due to the large number of ids's in the query I'll use python:
Planning is slow, due to the large array.
It is using the index on id to fetch rows, then filters them based on created_at. Thus rows not satisfying the condition on created_at still require heap fetches. Including created_at in the index would be useful.
An index on (created_at,id) would allow to scan the requested range of created_at, but it cannot index on ids. So the ids would have to be pulled out of the index and filtered. This would only be useful if the condition on created_at is very narrow and the most selctive in the query. Looking at the row counts in your EXPLAIN, I don't feel this is the case.
An index with id as the first column allows to fetch rows for each id directly. Then created_at has to be compared with the requested range. I feel this is more useful.
This pulls created_at from the index, avoiding heap fetches for rows that will be rejected, so it is slightly faster.
This would be useful if there were many rows for each id, each having a different created_at value, which is not the case here.
This query may cause lots of random IOs, so if the table is on spinning disk and not SSD, it will take a lot longer.
Using IN() instead of =ANY() does not change anything.
Besides including created_at in the index to avoid extra IO, there's not much opportunity to make it faster. This will need one index scan per id, there are 100k, so it comes down to 3µs per id which is pretty fast. Transferring that many rows to the client will also take time.
If you really need it faster, I'd recommend splitting the batches of id's into smaller chunks, and executing it in parallel over several connections. This has the advantage of parallelizing data encoding and decoding, and also processing on the client.
The following parallel python code runs in 100ms, which is quite a bit faster.