Splitting a multibyte string in Lua

4.5k Views Asked by At

I have a multibyte string in Lua.

local s = "あいうえお"

How do I take the string and split it into a table of strings?

In English texts, then I can use this code. But this does not work with the multibyte.

local s = "foo bar 123"
local words = {}
for word in s:gmatch("%w+") do
    table.insert( words, word )
end
3

There are 3 best solutions below

1
On

For a start .. from this SO question How to write a unicode symbol in lua, RBerteig's answer points to a library slnunicode

Also referred to in this SO question Is there any lua library that converts a string to bytes using utf8 encoding

0
On

As others have noted, it's hard to tell what you want to do: where do you want to split for non-ASCII characters, if splitting at spaces doesn't suffice?

If you just want to split between individual characters for non-ASCII characters, something like the following may suffice:

s = "oink barf 頑張っています"
for word in s:gmatch("[\33-\127\192-\255]+[\128-\191]*") do
   print (word)
end

produces:

oink
barf
頑
張
っ
て
い
ま
す

The trick here is that in UTF-8, multi-byte characters each consist of a "lead byte" with the top two bits equal to 11 (so \192\255 in Lua—remember, character escapes in Lua are decimal), followed by zero or more "trailing bytes" with the top two bits equal to 10 (\128\191 in Lua).

0
On

If it's UTF-8, In Lua 5.3, you can use the utf8 library like this:

local s = "あいうえお"
local words = {}
for _, c in utf8.codes(s) do
  table.insert(words, utf8.char(c))
end