Matching Unicode punctuation using LPeg

Question

Matching Unicode punctuation using LPeg

407 Views Asked by Witiko At 28 July 2025 at 02:13

I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:

local unicode     = require("unicode")
local lpeg        = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
  local match = unicode.utf8.match(a, "^%p")
  if match == nil
    return false
  else
    return i+#match
  end
end)

This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).

I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.

Original Q&A

There are 1 best solutions below

**frangio** · Accepted Answer

The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:

local cont = lpeg.R("\128\191") -- continuation byte

local utf8 = lpeg.R("\0\127")
           + lpeg.R("\194\223") * cont
           + lpeg.R("\224\239") * cont * cont
           + lpeg.R("\240\244") * cont * cont * cont

Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:

local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
    if unicode.utf8.match(c, "%p") then
        return i
    end
end)

Note that we return i, this is in accordance with what Cmt expects:

The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.

This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.

Matching Unicode punctuation using LPeg

There are 1 best solutions below

Related Questions in UNICODE

Related Questions in LUA

Related Questions in LPEG

Trending Questions

Popular # Hahtags

Popular Questions