Regex: How to Implement Negative Lookbehind in PL/SQL

620 Views Asked by At

How do I match all the strings that begin with loockup. and end with _id but not prefixed by msg? Here below are some examples:

lookup.asset_id -> should match
lookup.msg_id -> shouldn't match
lookup.whateverelse_id -> should match

I know Oracle does not support negative lookbehind (i.e. (?<!))... so I've tried to explicitly enumerate the possibilities using alternation:

regexp_count('i_asset := lookup.asset_id;', 'lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id') <> 0 then
    dbms_output.put_line('match'); -- this matches as expected
end if;

regexp_count('i_msg := lookup.msg_id;', 'lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id') <> 0 then
    dbms_output.put_line('match'); -- this shouldn’t match
                                   -- but it does like the previous example... why?
end if;

The second regexp_count expression should't match... but it does like the first one. Am I missing something?

EDIT

In the real use case, I've a string that contains PL/SQL code that might contains more than one lookup.xxx_id instances:

declare
    l_source_code varchar2(2048) := '
        ...
        curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');
        asset : = lookup.asset_id(key_val => ''UBSN''); -- this is wrong since it does
                                                        -- not specify key_type
        ...
        msg := lookup.msg_id(key_val => ''hello''); -- this is fine since msg_id does
                                                    -- not require key_type
    ';
    ...
 end;

I need to determine whether there is at least one wrong lookup, i.e. all occurrences, except lookup.msg_id, must also specify the key_type parameter.

2

There are 2 best solutions below

5
On

With lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id, you are basically asking to check for a string

  1. starting with lookup. denoted by lookup\.,
  2. followed by at least one character different from ( denoted by [^\(]+,
  3. followed by either -- ( | | )
    • one character different from m -- [^m], or
    • two characters: m plus no s -- m[^s], or
    • three characters: ms and no g -- ms[^g], and
  4. ending in _id denoted by _id.

So, for lookup.msg_id, the first part matches obviously, the second consumes ms, and leaves the g for the first alternative of the third.

This could be fixed by patching up the third part to be always three characters long like lookup\.[^\(]+([^m]..|m[^s.]|ms[^g])_id. This, however, would fail everything, where the part between lookup. and _id is not at least four characters long:

WITH
Input (s, r) AS (
  SELECT 'lookup.asset_id', 'should match' FROM DUAL UNION ALL
  SELECT 'lookup.msg_id', 'shouldn''t match' FROM DUAL UNION ALL
  SELECT 'lookup.whateverelse_id', 'should match' FROM DUAL UNION ALL
  SELECT 'lookup.a_id', 'should match' FROM DUAL UNION ALL
  SELECT 'lookup.ab_id', 'should match' FROM DUAL UNION ALL
  SELECT 'lookup.abc_id', 'should match' FROM DUAL
)
SELECT
  r, s, INSTR(s, 'lookup.msg_id') has_msg, REGEXP_COUNT(s , 'lookup\.[^\(]+([^m]..|m[^s]|ms[^g])_id') matched FROM Input
;

|               R |                      S | HAS_MSG | MATCHED |
|-----------------|------------------------|---------|---------|
|    should match |        lookup.asset_id |       0 |       1 |
| shouldn't match |          lookup.msg_id |       1 |       0 |
|    should match | lookup.whateverelse_id |       0 |       1 |
|    should match |            lookup.a_id |       0 |       0 |
|    should match |           lookup.ab_id |       0 |       0 |
|    should match |          lookup.abc_id |       0 |       0 |

If you have just to make sure, there is no msg in the position in question, you might want to go for (INSTR(s, 'lookup.msg_id') = 0) AND REGEXP_COUNT(s, 'lookup\.[^\(]+_id') <> 0

For code clarity REGEXP_INSTR(s, 'lookup\.[^\(]+_id') > 0 might be preferable…

@j3d Just comment if further detail is required.

0
On

With the requirements still being kind of vague…

  1. Split the string at the semicolon.
  2. Check each substring s to comply:

    WITH Input (s) AS (
      SELECT '   curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');' FROM DUAL UNION ALL
      SELECT 'curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');' FROM DUAL UNION ALL
      SELECT 'asset := lookup.asset_id(key_val => ''UBSN'');' FROM DUAL UNION ALL
      SELECT 'msg := lookup.msg_id(key_val => ''hello'');' FROM DUAL
    )
    SELECT
      s
    FROM Input
    WHERE REGEXP_LIKE(s, '^\s*[a-z]+\s+:=\s+lookup\.msg_id\(key_val => ''[a-zA-Z0-9]+''\);$')
     OR
     ((REGEXP_INSTR(s, '^\s*[a-z]+\s+:=\s+lookup\.msg_id') = 0)
      AND (REGEXP_INSTR(s, '[(,]\s*key_type') > 0)
      AND (REGEXP_INSTR(s,
        '^\s*[a-z]+\s+:=\s+lookup\.[a-z]+_id\(( ?key_[a-z]+ => ''[a-zA-Z_]+?'',?)+\);$') > 0)) 
    ;
    
    
    |                                                                        S |
    |--------------------------------------------------------------------------|
    |[tab] curry := lookup.curry_id(key_val => 'CHF', key_type => 'asset_iso');|
    |      curry := lookup.curry_id(key_val => 'CHF', key_type => 'asset_iso');|
    |                                 msg := lookup.msg_id(key_val => 'hello');|
    

This would tolerate a superfluous comma right before the closing parenthesis. But if the input is syntactically correct, such a comma won't exist.