Yet another Unicode preg_replace() question

46 Views Asked by At

I've read many posts that explain how to deal with Unicode characters, but none of the suggestions are working for me.

My php page reads a file that contains strings with high-order characters, e.g., "Mötor". I want to convert the strings to "normal" characters, e.g., "Motor".

This is what I have tried:

$source = "Mötor";
$test = preg_replace('/[^\w\d\p{L}]/u', "", $source); // Returns null.
$test = preg_replace('/[^\w\d\p{L}]/u', "", htmlentities($source)); // Returns "".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", $source); // Returns "Mötor".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities($source)); // Returns "".
$test = iconv('utf-8', 'ascii//TRANSLIT', $source); // Returns false.

I am stumped. Thanks!

3

There are 3 best solutions below

2
Sammitch On

This is called "transliteration" and intl's Transliterator will work far better than bodging together regular expressions.

$tests = [ "Mötor" ];

$tl = Transliterator::create('Latin-ASCII;');
foreach($tests as $str) {
    var_dump(
        $tl->transliterate($str)
    );
}

Output:

string(5) "Motor"
0
JosefZ On

A well proven way:

<?php
$source = "Mötor, šeřík, Προϊστορία, Україна";
var_dump( $source);
var_dump( preg_replace("/\p{Mn}/u", '',
            Normalizer::normalize( $source, Normalizer::FORM_D )));
?>

Output: .\SO\76446827.php

string(54) "Mötor, šeřík, Προϊστορία, Україна"
string(50) "Motor, serik, Προιστορια, Украіна"

Resources (required reading):

  • Unicode Normalization Forms

  • Regular expressions: Unicode Categories

    • \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.). - \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
      • \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
      • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
  • PHP manual: Unicode character properties (note /u option for Unicode support in regex)

Note: test string contains accented characters of various scripts (both Western and Eastern Latin, Greek, and Cyrillic) to demonstrate script-independency of used regex:

  • ö (U+00F6, Latin Small Letter O With Diaeresis)
  • š (U+0161, Latin Small Letter S With Caron)
  • ř (U+0159, Latin Small Letter R With Caron)
  • í (U+00ED, Latin Small Letter I With Acute)
  • ϊ (U+03CA, Greek Small Letter Iota With Dialytika)
  • ί (U+03AF, Greek Small Letter Iota With Tonos)
  • ї (U+0457, Cyrillic Small Letter Yi)
0
Steve A On

The solution that worked for me was to resave the file (using Notepad) while specifying UTF-8.

Per comments by others, another solution would be to use Transliterator. However, that is a php extension which isn't installed on the (shared) server I am using.