Case-insensitive matching in LPeg.re (Lua)

558 Views Asked by At

I'm new to the "LPeg" and "re" modules of Lua, currently I want to write a pattern based on following rules:

  1. Match the string that starts with "gv_$/gv$/v$/v_$/x$/xv$/dba_/all_/cdb_", and the prefix "SYS.%s*" or "PUBLIC.%s*" is optional
  2. The string should not follow a alphanumeric, i.e., the pattern would not match "XSYS.DBA_OBJECTS" because it follows "X"
  3. The pattern is case-insensitive

For example, below strings should match the pattern:

,sys.dba_objects,       --should return  "sys.dba_objects"    
SyS.Dba_OBJECTS
cdb_objects
dba_hist_snapshot)      --should return  "dba_hist_snapshot"   

Currently my pattern is below which can only match non-alphanumeric+string in upper case :

p=re.compile[[
         pattern <- %W {owner* name}
         owner   <- 'SYS.'/ 'PUBLIC.'
         name    <- {prefix %a%a (%w/"_"/"$"/"#")+}
         prefix  <- "GV_$"/"GV$"/"V_$"/"V$"/"DBA_"/"ALL_"/"CDB_"
      ]]
print(p:match(",SYS.DBA_OBJECTS")) 

My questions are:

  1. How to achieve the case-insensitive matching? There are some topics about the solution but I'm too new to understand
  2. How to exactly return the matched string only, instead of also have to plus %W? Something like "(?=...)" in Java

Highly appreciated if you can provide the pattern or related function.

2

There are 2 best solutions below

0
On BEST ANSWER

You can try to tweak this grammar

local re = require're'

local p = re.compile[[
    pattern <- ((s? { <name> }) / s / .)* !.
    name    <- (<owner> s? '.' s?)? <prefix> <ident>
    owner   <- (S Y S) / (P U B L I C)
    prefix  <- (G V '_'? '$') / (V '_'? '$') / (D B A '_') / (C D B '_')
    ident   <- [_$#%w]+
    s       <- (<comment> / %s)+
    comment <- '--' (!%nl .)*
    A       <- [aA]
    B       <- [bB]
    C       <- [cC]
    D       <- [dD]
    G       <- [gG]
    I       <- [iI]
    L       <- [lL]
    P       <- [pP]
    S       <- [sS]
    U       <- [uU]
    V       <- [vV]
    Y       <- [yY]
    ]]
local m = { p:match[[
,sys.dba_objects,       --should return  "sys.dba_objects"
SyS.Dba_OBJECTS
cdb_objects
dba_hist_snapshot)      --should return  "dba_hist_snapshot"
]] }
print(unpack(m))

. . . prints match table m:

sys.dba_objects SyS.Dba_OBJECTS cdb_objects     dba_hist_snapshot

Note that case-insensitivity is quite hard to achieve out of the lexer so each letter has to get a separate rule -- you'll need more of these eventually.

This grammar is taking care of the comments in your sample and skips them along with whitespace so matches after "should return" are not present in output.

You can fiddle with prefix and ident rules to specify additional prefixes and allowed characters in object names.

Note: !. means end-of-file. !%nl means "not end-of-line". ! p and & p are constructing non-consuming patterns i.e. current input pointer is not incremented on match (input is only tested).

Note 2: print-ing with unpack is a gross hack.

Note 3: Here is a tracable LPeg re that can be used to debug grammars. Pass true for 3-rd param of re.compile to get execution trace with test/match/skip action on each rule and position visited.

0
On

Finally I got an solution but not so graceful, which is to add an additional parameter case_insensitive into re.compile, re.find, re.match and re.gsubfunctions. When the parameter value is true, then invoke case_insensitive_pattern to rewrite the pattern:

...
local fmt="[%s%s]"
local function case_insensitive_pattern(quote,pattern)
    -- find an optional '%' (group 1) followed by any character (group 2)
    local stack={}
    local is_letter=nil
    local p = pattern:gsub("(%%?)(.)",
        function(percent, letter)
            if percent ~= "" or not letter:match("%a") then
                -- if the '%' matched, or `letter` is not a letter, return "as is"
                if is_letter==false then
                    stack[#stack]=stack[#stack]..percent .. letter
                else
                    stack[#stack+1]=percent .. letter
                    is_letter=false
                end
            else
                if is_letter==false then
                    stack[#stack]=quote..stack[#stack]..quote
                    is_letter=true
                end
                -- else, return a case-insensitive character class of the matched letter
                stack[#stack+1]=fmt:format(letter:lower(), letter:upper())
            end
            return ""
        end)
    if is_letter==false then
        stack[#stack]=quote..stack[#stack]..quote
    end
    if #stack<2 then return stack[1] or (quote..pattern..quote) end
    return '('..table.concat(stack,' ')..')'
end

local function compile (p, defs, case_insensitive)
  if mm.type(p) == "pattern" then return p end   -- already compiled
  if case_insensitive==true then
    p=p:gsub([[(['"'])([^\n]-)(%1)]],case_insensitive_pattern):gsub("%(%s*%((.-)%)%s*%)","(%1)")
  end
  local cp = pattern:match(p, 1, defs)
  if not cp then error("incorrect pattern", 3) end
  return cp
end
...