Yet another Unicode preg_replace() question

Question

Yet another Unicode preg_replace() question

46 Views Asked by Steve A At 10 June 2023 at 15:17

I've read many posts that explain how to deal with Unicode characters, but none of the suggestions are working for me.

My php page reads a file that contains strings with high-order characters, e.g., "Mötor". I want to convert the strings to "normal" characters, e.g., "Motor".

This is what I have tried:

$source = "Mötor";
$test = preg_replace('/[^\w\d\p{L}]/u', "", $source); // Returns null.
$test = preg_replace('/[^\w\d\p{L}]/u', "", htmlentities($source)); // Returns "".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", $source); // Returns "Mötor".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities($source)); // Returns "".
$test = iconv('utf-8', 'ascii//TRANSLIT', $source); // Returns false.

I am stumped. Thanks!

Original Q&A

There are 3 best solutions below

**Sammitch** · Answer 1 · 2023-06-10T19:40:39.537000

This is called "transliteration" and intl's Transliterator will work far better than bodging together regular expressions.

$tests = [ "Mötor" ];

$tl = Transliterator::create('Latin-ASCII;');
foreach($tests as $str) {
    var_dump(
        $tl->transliterate($str)
    );
}

Output:

string(5) "Motor"

**JosefZ** · Answer 2 · 2023-06-11T19:22:02.003000

A well proven way:

<?php
$source = "Mötor, šeřík, Προϊστορία, Україна";
var_dump( $source);
var_dump( preg_replace("/\p{Mn}/u", '',
            Normalizer::normalize( $source, Normalizer::FORM_D )));
?>

Output: .\SO\76446827.php

string(54) "Mötor, šeřík, Προϊστορία, Україна"
string(50) "Motor, serik, Προιστορια, Украіна"

Resources (required reading):

Unicode Normalization Forms
Regular expressions: Unicode Categories
- \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.). - \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
  - \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
  - \p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
PHP manual: Unicode character properties (note /u option for Unicode support in regex)

Note: test string contains accented characters of various scripts (both Western and Eastern Latin, Greek, and Cyrillic) to demonstrate script-independency of used regex:

ö (U+00F6, Latin Small Letter O With Diaeresis)
š (U+0161, Latin Small Letter S With Caron)
ř (U+0159, Latin Small Letter R With Caron)
í (U+00ED, Latin Small Letter I With Acute)
ϊ (U+03CA, Greek Small Letter Iota With Dialytika)
ί (U+03AF, Greek Small Letter Iota With Tonos)
ї (U+0457, Cyrillic Small Letter Yi)

**Steve A** · Answer 3 · 2023-06-14T14:17:55.137000

The solution that worked for me was to resave the file (using Notepad) while specifying UTF-8.

Per comments by others, another solution would be to use Transliterator. However, that is a php extension which isn't installed on the (shared) server I am using.

Yet another Unicode preg_replace() question

There are 3 best solutions below

Resources (required reading):

Related Questions in PHP

Related Questions in UNICODE-STRING

Trending Questions

Popular # Hahtags

Popular Questions