Oracle: How can I make "GROUP BY" with regExp terminate more quickly?

32 Views Asked by At

I hope there is someone to help me with the following problem.

  • What I have:

I have a table with search queries terms that contain IPTC category references. Basically the table contains IDs of search queries and for every query there can be n lines with terms as CLOB. The table contains more than 2500 entries.

e.g.:

| search_query_id | query_term                                   |
| --------------- | -------------------------------------------- |
| 1               | '(jackson or houston) AND NOT iptc=MUSIC'    |
| 1               | '(clinton or washington) AND NOT IPTC=MUSIC' |
| 1               | '(owens or bolt) AND IPTC=SPORT'             |
  • What I need:

I need to extract a list of distinct categories for each search query, regardless whether the category is included (IPTC=) or excluded (NOT IPTC=).

search_query_id category
1 MUSIC
1 SPORT
  • What I tried:
SELECT 
  search_query_id, 
  TO_CHAR(SUBSTR(REGEXP_SUBSTR(query_term, 'IPTC=\w+', 1, LEVEL), 6, 100)) AS category
FROM 
(
   select search_query_id, upper(query_term) as query_term 
   from search_query_term sqt 
   where instr(upper(query_term), 'IPTC=') > 0
)
CONNECT BY LEVEL <= REGEXP_count(query_term, 'IPTC=')
group by 
  search_query_id, 
  TO_CHAR(SUBSTR(REGEXP_SUBSTR(query_term, 'IPTC=\w+', 1, LEVEL), 6, 100))
;
  • What I got: When running the above statement, it runs for more than 10 minutes; actually I terminated it after 10 min. Omitting the GROUP BY makes it run within 15 seconds. The same effect can be seen when trying to get a count(*) - it seems to be running forever so I don't even know how many lines have to be processed.

Executing an explain plan provides the following output:

Plan hash value: 1016184565
 
------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                   |   700 | 96600 |   143   (2)| 00:00:01 |
|   1 |  HASH GROUP BY                         |                   |   700 | 96600 |   143   (2)| 00:00:01 |
|*  2 |   CONNECT BY WITHOUT FILTERING (UNIQUE)|                   |       |       |            |          |
|*  3 |    TABLE ACCESS FULL                   | SEARCH_QUERY_TERM |  1004 |   135K|   142   (1)| 00:00:01 |
------------------------------------------------------------------------------------------------------------
 
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
 
   1 - SEL$4FDC73A7
   3 - SEL$4FDC73A7 / SQT@SEL$2
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(LEVEL<= REGEXP_COUNT (UPPER("QUERY_TERM" /*+ LOB_BY_VALUE */ ),'IPTC='))
   3 - filter(INSTR(UPPER("QUERY_TERM" /*+ LOB_BY_VALUE */ ),'IPTC=')>0)
 
Column Projection Information (identified by operation id):
-----------------------------------------------------------
 
   1 - (#keys=2) "SEARCH_QUERY_ID"[NUMBER,22], TO_CHAR(SUBSTR( REGEXP_SUBSTR (UPPER("QUERY_TERM" 
       /*+ LOB_BY_VALUE */ ),'IPTC=\w+',1,LEVEL),6,100))[400]
   2 - "QUERY_TERM" /*+ LOB_BY_VALUE */ [LOB,466], "SEARCH_QUERY_ID"[NUMBER,22], LEVEL[4]
   3 - "QUERY_TERM" /*+ LOB_BY_VALUE */ [LOB,466], "SEARCH_QUERY_ID"[NUMBER,22]
  • What I want to know: What am I doing wrong? Is there a better / another way to group the lines?
0

There are 0 best solutions below