I have a very slow query due to scanning through millions of records. The query searches for how many numbers are in a specific range.
I have 2 tables: numbers_in_ranges and person table
Create table numbers_in_ranges
( range_id number(9,0) ,
begin_range number(9,0),
end_range number(9,0)
) ;
Create table person
(
id integer,
a_number varchar(9),
first_name varchar(25),
last_name varchar(25)
);
Data for numbers_in_ranges
range_id| begin_range | end_range
--------|------------------------
101 | 100000000 | 200000000
102 | 210000000 | 290000000
103 | 350000000 | 459999999
104 | 461000000 | 569999999
106 | 241000000 | 241999999
e.t.c.
Data for person
id | a_number | first_name | last_name
---|------------|------------|-----------
1 | 100000001 | Maria | Doe
2 | 100000999 | Emily | Davis
3 | 150000000 | Dave | Smith
4 | 461000000 | Jane | Jones
6 | 241000001 | John | Doe
7 | 100000002 | Maria | Doe
8 | 100009999 | Emily | Davis
9 | 150000010 | Dave | Smith
10 | 210000001 | Jane | Jones
11 | 210000010 | John | Doe
12 | 281000000 | Jane | Jones
13 | 241000000 | John | Doe
14 | 460000001 | Maria | Doe
15 | 500000999 | Emily | Davis
16 | 550000010 | Dave | Smith
17 | 461000010 | Jane | Jones
18 | 241000020 | John | Doe
e.t.c.
We are getting the range data from a remote database via a database link and storing it in a materialized view.
The query
select nums.range_id, count(p. a_number) as a_count
from number_in_ranges nums
left join person p on to_number(p. a_number)
between nums.begin_range and nums.end_range
group by nums.range_id;
The result looks like
range_id| a_count
--------|------------------------
101 | 6
102 | 5
103 | 2
104 | 3
e.t.c
As I said, this query is very slow.
Here is the explain plan
Plan hash value: 3785994407
---------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9352 | 264K| | 42601 (31)| 00:00:02 | | | |
| 1 | PX COORDINATOR | | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,01 | P->P | HASH |
| 6 | HASH GROUP BY | | 9352 | 264K| | 42601 (31)| 00:00:02 | Q1,01 | PCWP | |
| 7 | MERGE JOIN OUTER | | 2084M| 56G| | 37793 (23)| 00:00:02 | Q1,01 | PCWP | |
| 8 | SORT JOIN | | 9352 | 173K| | 3 (34)| 00:00:01 | Q1,01 | PCWP | |
| 9 | PX BLOCK ITERATOR | | 9352 | 173K| | 2 (0)| 00:00:01 | Q1,01 | PCWC | |
| 10 | MAT_VIEW ACCESS FULL | NUMBERS_IN_RANGES | 9352 | 173K| | 2 (0)| 00:00:01 | Q1,01 | PCWP | |
|* 11 | FILTER | | | | | | | Q1,01 | PCWP | |
|* 12 | SORT JOIN | | 89M| 850M| 2732M| 29681 (1)| 00:00:02 | Q1,01 | PCWP | |
| 13 | BUFFER SORT | | | | | | | Q1,01 | PCWC | |
| 14 | PX RECEIVE | | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,01 | PCWP | |
| 15 | PX SEND BROADCAST | :TQ10000 | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | P->P | BROADCAST |
| 16 | PX BLOCK ITERATOR | | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | PCWC | |
| 17 | INDEX FAST FULL SCAN| PERSON_AN_IDX | 89M| 850M| | 4944 (1)| 00:00:01 | Q1,00 | PCWP | |
---------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
11 - filter("NUMS"."END_RANGE">=TO_NUMBER("P"."A_NUMBER"(+)))
12 - access("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
filter("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
Note
-----
- automatic DOP: Computed Degree of Parallelism is 16 because of degree limit
I tried to run the deltas for the month and then append them to the table, like:
if new range_id is found then insert
if range_id is found then update
So we don't have to scan the whole table.
But this solution didn't work because some ranges are updated, and splicing happens, for example:
We create a new range_id = 110 with ranges between 100110000 and 210000001
then range_id = 101 is spliced to 100000000 and 100110000
and range_id = 102 is spliced to 100110001 and 210000000 ;
Now I thought of creating a trigger for when a new range is created or updated to update that table; however, that is impossible since we are getting this data from a remote database that stores the data into a Materialized View, and we cannot put a trigger on a read-only materialized view.
My question is there any other way that I can do this or optimize this query?
Thank you!
The issue is that Oracle tries to broadcast the table with all ID's that looks quite strange for this case.
However, since you need only to count rows and (it looks like) the intervals do not overlap, you may improve the performance and avoid
joinof two datasets using a trick: transform the data to event stream where each start and end value identifies the beginning and end of series and then count the number of events in this series. This way you may usematch_recognizewhich is dramatically faster thanjoin.The query will be:
which shows this time in the query plan:
Compared to
joinquery:which shows:
See db<>fiddle.