Split sentence into words (with special word list)

281 Views Asked by At

I have sentence:

$text = "word word, dr. word: a.sh. word a.k word?!..";

special words are: "dr." , "a.sh" and "a.k"

this :

$text = "word word, dr. word: a.sh. word a.k word?!..";
$split = preg_split("/[^\w]([\s]+[^\w]|$)/", $text, -1, PREG_SPLIT_NO_EMPTY);
print_r($split);

regular expression gives me this:

 Array (   
     [0] => word   
     [1] => word  
     [2] => dr  
     [3] => word    
     [4] => a.sh   
     [5] => word   
     [6] => a.k   
     [7] => word     ) 

and i need

Array (
[0] => word
[1] => word
[2] => dr. #<----- point must be here becouse "dr." is special word [3] => word
[4] => a.sh. #<----- point must be here becouse "a.sh" is special word [5] => word
[6] => a.k
[7] => word)

1

There are 1 best solutions below

9
On

I think you're going about this backwards. Instead of trying to define a regular expression that is not a word - define what is a word, and capture all character sequences that match that.

$special_words = array("dr.", "a.sh.", "a.k");
array_walk($special_words, function(&$item, $key){ $item= preg_quote($item, '~');});

$regex = '~(?<!\w)(' . implode('|', $special_words) . '|\w+)(?!\w)~';
$str = 'word word, dr. word: a.sh. word a.k word?!..';
preg_match_all($regex, $str, $matches);
var_dump($matches[0]);

The keys here are an array of special words, the array_walk, and the regular expression.

array_walk

This line, right after your array definition, walks through each of your special words and escapes all of the REGEX special characters (like . and ?), including the delimiter we're going to use later. That way, you can define whatever words you like and you don't have to worry about how it will affect the regular expression.

Regular Expression.

The Regex is actually pretty simple. Implode the special words using a | as glue, then add another pipe and your standard word definition (I chose w+ because it makes the most sense to me.) Surround that giant alternation with parentheses to group it, and I added a lookbehind and a lookahead to ensure we weren't stealing from the middle of a word. Because regex works left to right, the a in a.sh. won't be split off into its own word, because the a.sh. special word will capture it. Unless it says a.sh.e, in which case, each part of the three part expression will match as three separate words.

Check it out.