How to strip a string to get the exact expected output regardless of how the string is formatted in Oracle 11g

87 Views Asked by At

I have a set of strings on which I am performing some operations to retrieve only the part of the string that I want for further operations.

I have used various substr, instr, regexp_replace etc and achieved for some of the strings in a huge list. I initially found that the string is separated using '-'(underscore) characters and the part of string I want starts after third instance of the delimeter.

But, when I researched further the format became clumsy and the operations I did are not giving the expected output for all the type of strings in the columns.

Please find the attached Oracle(11g) SQL code below and help me in optimizing it to retrieve the expected output every time regardless of the string format.

SELECT 'ABCD-012345-EFG-10vXRI47HU-1' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10vXRI47HU-1',INSTR('ABCD-012345-EFG-10vXRI47HU-1','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10vXRI47HU' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSD0U/2' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSD0U/2',INSTR('ABCD-012345-EFG-10zSD0U/2','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSD0U' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10ZsE8h -1' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10ZsE8h -1',INSTR('ABCD-012345-EFG-10ZsE8h -1','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10ZsE8h' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG- 10zSe9K ' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG- 10zSe9K ',INSTR('ABCD-012345-EFG- 10zSe9K ','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-.10zSe9K' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-.10zSe9K',INSTR('ABCD-012345-EFG-.10zSe9K','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K_2' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K_2',INSTR('ABCD-012345-EFG-10zSe9K_2','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K.' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K.',INSTR('ABCD-012345-EFG-10zSe9K.','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345--EFG-10zSe9K' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345--EFG-10zSe9K',INSTR('ABCD-012345--EFG-10zSe9K','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K-' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K-',INSTR('ABCD-012345-EFG-10zSe9K-','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-012345-EFG-10zSe9K-//1/23h' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-012345-EFG-10zSe9K-//1/23h',INSTR('ABCD-012345-EFG-10zSe9K-//1/23h','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL
UNION ALL
SELECT 'ABCD-EFGH-HIJK-012345-10zSe9K' AS MAIN_STRING,
(TRIM(REGEXP_REPLACE(SUBSTR('ABCD-EFGH-HIJK-012345-10zSe9K',INSTR('ABCD-EFGH-HIJK-012345-10zSe9K','-',1,3)+1),'[^0-9A-Za-z]',''))) AS FUNCTION_OUTPUT,
'10zSe9K' AS EXPECTED_OUTPUT FROM DUAL

The output of the above query looks as below:

MAIN_STRING                     FUNCTION_OUTPUT   EXPECTED_OUTPUT
ABCD-012345-EFG-10vXRI47HU-1    10vXRI47HU1       10vXRI47HU
ABCD-012345-EFG-10zSD0U/2       10zSD0U2          10zSD0U   
ABCD-012345-EFG-10ZsE8h -1      10ZsE8h1          10ZsE8h   
ABCD-012345-EFG- 10zSe9K        10zSe9K           10zSe9K   
ABCD-012345-EFG-.10zSe9K        10zSe9K           10zSe9K   
ABCD-012345-EFG-10zSe9K_2       10zSe9K2          10zSe9K   
ABCD-012345-EFG-10zSe9K.        10zSe9K           10zSe9K   
ABCD-012345--EFG-10zSe9K        EFG10zSe9K        10zSe9K   
ABCD-012345-EFG-10zSe9K-        10zSe9K           10zSe9K   
ABCD-012345-EFG-10zSe9K-//1/23h 10zSe9K123h       10zSe9K   
ABCD-EFGH-HIJK-012345-10zSe9K   01234510zSe9K     10zSe9K   

You can observe there are differences between function_output and expected_output in some cases.

Note: The expected output might not always start with -10. It can be any alpha numeric value.

The only rule for this assignment is to extract only the expected output from the strings. The string formats can get too clumsy. I am trying to find a way to get the expected output everytime irrespective of the string format and delimeters available.

I have asked if the expected output can be of only certain number of characters everytime but I still did not get the answer for that.

But based on the data I have seen, it can start with 7 alpha numeric characters and go upto 12 alpha numeric characters. For that reason, I could not use a range of characters in the substring function.

So, I could not limit the substr function to certain length.

I would like to remove the differences and make the function_output match with expected_output every time.

Any help to achieve the expected output every time is highly appreciated.

Thanks In Advance.

Please let me know if you need anymore details.

2

There are 2 best solutions below

2
d r On

Not sure that there is no more rules than in the sample data and expected result. With data provided - this should work:

Select 
    MAIN_STRING,
    CASE 
        WHEN INSTR(E_4_A, '-', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '-', 1, 1) - 1 ) 
        WHEN INSTR(E_4_A, '/', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '/', 1, 1) - 1 )
        WHEN INSTR(E_4_A, '_', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '_', 1, 1) - 1 )
    ELSE E_4_A END "E_4"
From
    ( SELECT 
        REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', '') "MAIN_STRING", 
        SubStr(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), InStr(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), '-10', 1, 1) + 1) "E_4_A" 
      FROM  strings  )
/* R e s u l t :
MAIN_STRING                     E_4                           
------------------------------- -------------------------------
ABCD-012345-EFG-10vXRI47HU-1    10vXRI47HU                      
ABCD-012345-EFG-10zSD0U/2       10zSD0U                         
ABCD-012345-EFG-10ZsE8h-1       10ZsE8h                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K_2       10zSe9K                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K         10zSe9K                         
ABCD-012345-EFG-10zSe9K-        10zSe9K                         
ABCD-012345-EFG-10zSe9K-//1/23h 10zSe9K                         
ABCD-EFGH-HIJK-012345-10zSe9K   10zSe9K                     
*/

.. the above code key to look for is '-10' and, when found, get ridd of what you don't want to keep.
... if you want to split it just by '-' then try it this way.

Select 
    MAIN_STRING,
    RTRIM(
            CASE 
                WHEN INSTR(E_4_A, '-', 1, 1) > 0 And Length(SubStr(E_4_A, INSTR(E_4_A, '-', 1, 1) + 1)) <= 5 THEN SubStr(E_4_A, 1, InStr(E_4_A, '-', 1, 1) - 1 ) 
                WHEN INSTR(E_4_A, '-', 1, 1) > 0 And Length(SubStr(E_4_A, INSTR(E_4_A, '-', 1, 1) + 1)) > 5 THEN SubStr(E_4_A, InStr(E_4_A, '-', 1, 1) + 1 ) 
                WHEN INSTR(E_4_A, '/', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '/', 1, 1) - 1 )
                WHEN INSTR(E_4_A, '_', 1, 1) > 0 THEN SubStr(E_4_A, 1, InStr(E_4_A, '_', 1, 1) - 1 )
            ELSE E_4_A 
            END, '-') "E_4"
From
    ( SELECT 
        REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', '') "MAIN_STRING", 
        SubStr(REPLACE(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), '//', ''), InStr(REPLACE(REPLACE(REPLACE(REPLACE(MAIN_STRING, '--', '-'), ' ', ''), '.', ''), '//', ''), '-', 1, 3) + 1) "E_4_A" 
      FROM  strings  )
0
William Robertson On

How about this:

with test_data (main_string, expected_output) as (
     select 'ABCD-012345-EFG-10vXRI47HU-1',   '10vXRI47HU'from dual union all
     select 'ABCD-012345-EFG-10zSD0U/2',       '10zSD0U'from dual union all
     select 'ABCD-012345-EFG-10ZsE8h -1',      '10ZsE8h'from dual union all
     select 'ABCD-012345-EFG- 10zSe9K ',       '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-.10zSe9K',        '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K_2',       '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K.',        '10zSe9K'from dual union all
     select 'ABCD-012345--EFG-10zSe9K',        '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K-',        '10zSe9K'from dual union all
     select 'ABCD-012345-EFG-10zSe9K-//1/23h', '10zSe9K'from dual union all
     select 'ABCD-EFGH-HIJK-012345-10zSe9K',   '10zSe9K'from dual
)
select main_string
     , expected_output
     , regexp_replace
       ( regexp_replace(main_string,'^(([A-Z]+[ -.]+)|([0-9]+[ -.]+))+')
       , '(\W|_).*$' ) as my_output
from   test_data;

The inner regex_replace strips any runs of all-characters or all-numbers followed by spaces, hyphens or dots from the start of each string.

The outer regex_replace removes anything starting with a non-word character (\W) or underscore from the end. (You could also do it in one pass by constructing the pattern as three subexpressions and picking the second one, which might be fractionally more efficient for a large data set.)

DBFiddle
regex101 for inner replace
regex101 for outer replace example