I have sentence:
$text = "word word, dr. word: a.sh. word a.k word?!..";
special words are: "dr." , "a.sh" and "a.k"
this :
$text = "word word, dr. word: a.sh. word a.k word?!..";
$split = preg_split("/[^\w]([\s]+[^\w]|$)/", $text, -1, PREG_SPLIT_NO_EMPTY);
print_r($split);
regular expression gives me this:
Array (
[0] => word
[1] => word
[2] => dr
[3] => word
[4] => a.sh
[5] => word
[6] => a.k
[7] => word )
and i need
Array (
[0] => word
[1] => word
[2] => dr. #<----- point must be here becouse "dr." is special word [3] => word
[4] => a.sh. #<----- point must be here becouse "a.sh" is special word [5] => word
[6] => a.k
[7] => word)
I think you're going about this backwards. Instead of trying to define a regular expression that is not a word - define what is a word, and capture all character sequences that match that.
The keys here are an array of special words, the array_walk, and the regular expression.
array_walk
This line, right after your array definition, walks through each of your special words and escapes all of the REGEX special characters (like
.
and?
), including the delimiter we're going to use later. That way, you can define whatever words you like and you don't have to worry about how it will affect the regular expression.Regular Expression.
The Regex is actually pretty simple. Implode the special words using a
|
as glue, then add another pipe and your standard word definition (I chosew+
because it makes the most sense to me.) Surround that giant alternation with parentheses to group it, and I added a lookbehind and a lookahead to ensure we weren't stealing from the middle of a word. Because regex works left to right, thea
ina.sh.
won't be split off into its own word, because thea.sh.
special word will capture it. Unless it saysa.sh.e
, in which case, each part of the three part expression will match as three separate words.Check it out.