I hope there is someone to help me with the following problem.
- What I have:
I have a table with search queries terms that contain IPTC category references. Basically the table contains IDs of search queries and for every query there can be n lines with terms as CLOB. The table contains more than 2500 entries.
e.g.:
| search_query_id | query_term |
| --------------- | -------------------------------------------- |
| 1 | '(jackson or houston) AND NOT iptc=MUSIC' |
| 1 | '(clinton or washington) AND NOT IPTC=MUSIC' |
| 1 | '(owens or bolt) AND IPTC=SPORT' |
- What I need:
I need to extract a list of distinct categories for each search query, regardless whether the category is included (IPTC=) or excluded (NOT IPTC=).
| search_query_id | category |
|---|---|
| 1 | MUSIC |
| 1 | SPORT |
- What I tried:
SELECT
search_query_id,
TO_CHAR(SUBSTR(REGEXP_SUBSTR(query_term, 'IPTC=\w+', 1, LEVEL), 6, 100)) AS category
FROM
(
select search_query_id, upper(query_term) as query_term
from search_query_term sqt
where instr(upper(query_term), 'IPTC=') > 0
)
CONNECT BY LEVEL <= REGEXP_count(query_term, 'IPTC=')
group by
search_query_id,
TO_CHAR(SUBSTR(REGEXP_SUBSTR(query_term, 'IPTC=\w+', 1, LEVEL), 6, 100))
;
- What I got: When running the above statement, it runs for more than 10 minutes; actually I terminated it after 10 min. Omitting the GROUP BY makes it run within 15 seconds. The same effect can be seen when trying to get a count(*) - it seems to be running forever so I don't even know how many lines have to be processed.
Executing an explain plan provides the following output:
Plan hash value: 1016184565
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 700 | 96600 | 143 (2)| 00:00:01 |
| 1 | HASH GROUP BY | | 700 | 96600 | 143 (2)| 00:00:01 |
|* 2 | CONNECT BY WITHOUT FILTERING (UNIQUE)| | | | | |
|* 3 | TABLE ACCESS FULL | SEARCH_QUERY_TERM | 1004 | 135K| 142 (1)| 00:00:01 |
------------------------------------------------------------------------------------------------------------
Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
1 - SEL$4FDC73A7
3 - SEL$4FDC73A7 / SQT@SEL$2
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(LEVEL<= REGEXP_COUNT (UPPER("QUERY_TERM" /*+ LOB_BY_VALUE */ ),'IPTC='))
3 - filter(INSTR(UPPER("QUERY_TERM" /*+ LOB_BY_VALUE */ ),'IPTC=')>0)
Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (#keys=2) "SEARCH_QUERY_ID"[NUMBER,22], TO_CHAR(SUBSTR( REGEXP_SUBSTR (UPPER("QUERY_TERM"
/*+ LOB_BY_VALUE */ ),'IPTC=\w+',1,LEVEL),6,100))[400]
2 - "QUERY_TERM" /*+ LOB_BY_VALUE */ [LOB,466], "SEARCH_QUERY_ID"[NUMBER,22], LEVEL[4]
3 - "QUERY_TERM" /*+ LOB_BY_VALUE */ [LOB,466], "SEARCH_QUERY_ID"[NUMBER,22]
- What I want to know: What am I doing wrong? Is there a better / another way to group the lines?