Normalizing strings for text matching with preg_replace

391 Views Asked by At

I'm performing a pretty simple text matching between a set of names from my MySQL db and a set of strings from a CSV file. Before the actual comparison, I run preg_replace with an array of options to normalize the strings. One of the important replacements is changing irregular abbreviations into regular full words. But I can't seem to capture abbreviations like "Inc." and "Inc", "Corp." and "Corp" that may or may not have a trailing period.

Here is the code:

$patterns = array();
$patterns[0] = '/\s+/';
$patterns[1] = '/&/';
$patterns[2] = '/\bAssoc\.{0,1}\b/';
$patterns[3] = '/\bInc(?!\.)\b/';
$patterns[4] = '/\b(L\.?){2}P\.?/';
$patterns[5] = '/\bUniv(\s|\.)+\b/';
$patterns[6] = '/\bCorp\.?/';
$patterns[7] = '/\bAssn\.?/';
$patterns[8] = '/\bUnivesity\b/';
$patterns[9] = '/\bIntl.\b/';

$replacement = array();
$replacement[0] = ' ';
$replacement[1] = 'and';
$replacement[2] = 'Association';
$replacement[3] = 'Inc.';
$replacement[4] = '';
$replacement[5] = 'University';
$replacement[6] = 'Corporation';
$replacement[7] = 'Association';
$replacement[8] = 'University';
$replacement[9] = 'International';

$name = trim(preg_replace($patterns,$replacement,$name));
if(stristr($name,trim(preg_replace($patterns,$replacement,$org->org_name)))) return $org->org_id;
// code here
}

Here are some matches that aren't working (more to come):

Haystack => Needle

  • "Aries International Inc." => "Aries Intl. Inc."
  • "Phelps Dodge Corporation" => "Phelps Dodge Corp."
  • "McDermott Incorporated" => "McDermott Inc."

As far as I can tell, it's not catching "Inc." and "Corp.", at least not consistently. Any help?

1

There are 1 best solutions below

0
On BEST ANSWER

Put the \b right after the abbreviation followed by a dot which is optional like so:

$patterns[2] = '/\bAssoc\b\.?/';