I'm trying to match kanji compounds in a Japanese sentence using regex.
Right now, I'm using / ((.)*) /
to match a space delimited compound in, for example, 彼はそこに ひと人 でいた。
The problem is, that in some sentence the word is at the beginning, or followed with a punctuation characters. Ex. いっ瞬 の間が生まれた。
or 一昨じつ、彼らはそこを出発した。
I've tried something like / ((.)*) |^((.)*) | ((.)*)、 etc.
But this matches 彼はそこに ひと人
instead of ひと人
in 彼はそこに ひと人 でいた。
Is there any way to pack all this in a single regex, or do I have to use one, check whether it returned anything, then try another one if not?
Thanks!
P.S.: I'm using PHP to parse the sentences.
After thinking about it for a long time I believe there's no way to parse the compounds without delimiting them all with spaces or any other characters which is what I'm doing now :)
Ex. if the sentence is
私は ノート、ペンなどが必要だ。
, there is no way for the computer to know whether it's私は
(start sentence & space delimited) orノート
(space & comma delimited) that is the right it should choose.Thanks everyone for your suggestions...