Moses Tokenizer is the tokenizer widely used in machine translation and natural language processing experiments.
There is a line of regex that checks for:
if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) ||
($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) ||
($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))
Please correct me if I'm wrong, the 2nd and 3rd conditions are to check
- whether the prefix is in a list of nonbreaking prefixes
- whether the word is not the last token and there is still a lowercased token as the next word.
The question is on the first condition where it checks for:
($pre =~ /\./ && $pre =~ /\p{IsAlpha}/)
Is the
$pre =~ /\./
checking whether the prefix is a single fullstop?And is
$pre =~ /\p{IsAlpha}/
checking whether the prefix is an alpha from the list of alphabet in the perluniprop?One related question is whether the fullstop is already inside the perluniprop alphabet? If so, wouldn't this condition never be true?
Can't tell without knowing what
%NONBREAKING_PREFIX
contains, but it's a fair guess.Assuming the code is iterating over
@words
, and$i
is the index of the current word, then it checks if the current word is followed by a word that starts with a lowercase letter (as defined by Unicode).Not quite. It checks if any of the characters in the string in
$pre
is a FULL STOP.Perl first tries to find a match at position 0, then at position 1, etc, until it finds a match.
\p{IsAlpha}
is indeed defined in perluniprops. [Note the correct spelling.] It definesso
\p{IsAlpha}
is an alias for\p{Alphabetic=Y}
[1]. Unicode defines what characters are Alphabetic[2]. There are quite a few:So back to the question.
$pre =~ /\p{IsAlpha}/
checks if any of the characters in the string in$pre
is an alphabetic character.No.
In contrast,
Underscores and spaces are ignored, so
\p{IsAlpha}
,\p{Is_Alpha}
and\p{I s_A l p_h_a}
are all equivalent.The list of alphabetic characters is slightly different than the list of letter characters.
All letters are alphabetic, but so are some alphabetic marks, roman numerals, etc.