PHP allow all accented characters in person name, but don't allow Chinese/Russian characters

Question

PHP allow all accented characters in person name, but don't allow Chinese/Russian characters

99 Views Asked by CopperRabbit At 24 January 2024 at 13:40

I am having issues with allowing all English/Latin based characters (including accents), but disallowing Chinese/Russian characters.

The first version I had was as follows:

strlen($values['person_name']) != mb_strlen($values['person_name'], 'utf-8')

This one worked fine initially, but when Icelandic/Czech names came into play, this did not work anymore.

The second version I had was as follows:

preg_match("~^[a-zÀ-ÿ][\'a-zÀ-ÿ \-]*$~i", $values['person_name'])

This seemed to work fine for majority of cases, but it is giving an error on a test name

Eliška Koňaříková

I have tried the following as well without any luck:

preg_match("/[^\w ]/u", $values['person_name'])      //does not allow š
preg_match("/\PL/u", $values['person_name'])      //does not allow š
preg_match("/^[a-zA-Z\s,.'\-\pL]+$/u", $values['person_name'])      //allows š, but also allows 書
preg_match("/^[\s,.'-]*\p{L}[\p{L}\s,.'-]*$/u", $values['person_name'])      //allows š, but also allows 書
preg_match("/[^a-zA-Z0-9àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý,. ]/u", $values['person_name'])      //allows š, but also allows 書
preg_match("~^[a-zÀ-ÿ][\'a-zÀ-ÿ \-]*$~iu", $values['person_name'])      //does not allow š
preg_match("/^[\p{L}-]*$/u", $values['person_name'])      //allows š, but also allows 書
preg_match("/([\w ]{2,})/u", $values['person_name'])      //allows š, but also allows 書
preg_match('/[^\p{Latin}0-9€, !"§$%&\/()=#|<>]/u', $values['person_name'])      //allows š, but also allows 書

All of the above either failed with the name provided, or it allowed Chinese characters.

I believe the best route for me would be to revert back to the check that was working for most characters (except with the Czech names that are giving an error):

preg_match("~^[a-zÀ-ÿ][\'a-zÀ-ÿ \-]*$~i", $values['person_name'])

And manually add the Czech characters that are not accepted such as š, ň, ř, etc.

Is there a cleaner solution than manually having to specify each of these characters?

Original Q&A

There are 3 best solutions below

**Milad Elyasi** · Answer 1 · 2024-01-24T14:41:08.043000

maybe it's better to replace the chars, this is only an example of doing that and it's not a complete function:

<?php
replace($str, $options = array())
    {

        // Make sure string is in UTF-8 and strip invalid UTF-8 characters
        $str = mb_convert_encoding((string)$str, 'UTF-8', mb_list_encodings());

        $defaults = array(
            'delimiter' => '',
            'limit' => null,
            'lowercase' => true,
            'replacements' => array(),
            'transliterate' => false,
        );

        // Merge options
        $options = array_merge($defaults, $options);

        $char_map = array(
            // Latin
            'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C',
            'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I',
            'Ð' => 'D', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', 'Ő' => 'O',
            'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'Ű' => 'U', 'Ý' => 'Y', 'Þ' => 'TH',
            'ß' => 'ss',
            'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae', 'ç' => 'c',
            'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i',
            'ð' => 'd', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ő' => 'o',
            'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ű' => 'u', 'ý' => 'y', 'þ' => 'th',
            'ÿ' => 'y',
            // Latin symbols
            '©' => '(c)',
            // Greek
            'Α' => 'A', 'Β' => 'B', 'Γ' => 'G', 'Δ' => 'D', 'Ε' => 'E', 'Ζ' => 'Z', 'Η' => 'H', 'Θ' => '8',
            'Ι' => 'I', 'Κ' => 'K', 'Λ' => 'L', 'Μ' => 'M', 'Ν' => 'N', 'Ξ' => '3', 'Ο' => 'O', 'Π' => 'P',
            'Ρ' => 'R', 'Σ' => 'S', 'Τ' => 'T', 'Υ' => 'Y', 'Φ' => 'F', 'Χ' => 'X', 'Ψ' => 'PS', 'Ω' => 'W',
            'Ά' => 'A', 'Έ' => 'E', 'Ί' => 'I', 'Ό' => 'O', 'Ύ' => 'Y', 'Ή' => 'H', 'Ώ' => 'W', 'Ϊ' => 'I',
            'Ϋ' => 'Y',
            'α' => 'a', 'β' => 'b', 'γ' => 'g', 'δ' => 'd', 'ε' => 'e', 'ζ' => 'z', 'η' => 'h', 'θ' => '8',
            'ι' => 'i', 'κ' => 'k', 'λ' => 'l', 'μ' => 'm', 'ν' => 'n', 'ξ' => '3', 'ο' => 'o', 'π' => 'p',
            'ρ' => 'r', 'σ' => 's', 'τ' => 't', 'υ' => 'y', 'φ' => 'f', 'χ' => 'x', 'ψ' => 'ps', 'ω' => 'w',
            'ά' => 'a', 'έ' => 'e', 'ί' => 'i', 'ό' => 'o', 'ύ' => 'y', 'ή' => 'h', 'ώ' => 'w', 'ς' => 's',
            'ϊ' => 'i', 'ΰ' => 'y', 'ϋ' => 'y', 'ΐ' => 'i',
            // Turkish
            'Ş' => 'S', 'İ' => 'I', 'Ç' => 'C', 'Ü' => 'U', 'Ö' => 'O', 'Ğ' => 'G',
            'ş' => 's', 'ı' => 'i', 'ç' => 'c', 'ü' => 'u', 'ö' => 'o', 'ğ' => 'g',
            // Russian
            'А' => 'A', 'Б' => 'B', 'В' => 'V', 'Г' => 'G', 'Д' => 'D', 'Е' => 'E', 'Ё' => 'Yo', 'Ж' => 'Zh',
            'З' => 'Z', 'И' => 'I', 'Й' => 'J', 'К' => 'K', 'Л' => 'L', 'М' => 'M', 'Н' => 'N', 'О' => 'O',
            'П' => 'P', 'Р' => 'R', 'С' => 'S', 'Т' => 'T', 'У' => 'U', 'Ф' => 'F', 'Х' => 'H', 'Ц' => 'C',
            'Ч' => 'Ch', 'Ш' => 'Sh', 'Щ' => 'Sh', 'Ъ' => '', 'Ы' => 'Y', 'Ь' => '', 'Э' => 'E', 'Ю' => 'Yu',
            'Я' => 'Ya',
            'а' => 'a', 'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e', 'ё' => 'yo', 'ж' => 'zh',
            'з' => 'z', 'и' => 'i', 'й' => 'j', 'к' => 'k', 'л' => 'l', 'м' => 'm', 'н' => 'n', 'о' => 'o',
            'п' => 'p', 'р' => 'r', 'с' => 's', 'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c',
            'ч' => 'ch', 'ш' => 'sh', 'щ' => 'sh', 'ъ' => '', 'ы' => 'y', 'ь' => '', 'э' => 'e', 'ю' => 'yu',
            'я' => 'ya',
            // Ukrainian
            'Є' => 'Ye', 'І' => 'I', 'Ї' => 'Yi', 'Ґ' => 'G',
            'є' => 'ye', 'і' => 'i', 'ї' => 'yi', 'ґ' => 'g',
            // Czech
            'Č' => 'C', 'Ď' => 'D', 'Ě' => 'E', 'Ň' => 'N', 'Ř' => 'R', 'Š' => 'S', 'Ť' => 'T', 'Ů' => 'U',
            'Ž' => 'Z',
            'č' => 'c', 'ď' => 'd', 'ě' => 'e', 'ň' => 'n', 'ř' => 'r', 'š' => 's', 'ť' => 't', 'ů' => 'u',
            'ž' => 'z',
            // Polish
            'Ą' => 'A', 'Ć' => 'C', 'Ę' => 'e', 'Ł' => 'L', 'Ń' => 'N', 'Ó' => 'o', 'Ś' => 'S', 'Ź' => 'Z',
            'Ż' => 'Z',
            'ą' => 'a', 'ć' => 'c', 'ę' => 'e', 'ł' => 'l', 'ń' => 'n', 'ó' => 'o', 'ś' => 's', 'ź' => 'z',
            'ż' => 'z',
            // Latvian
            'Ā' => 'A', 'Č' => 'C', 'Ē' => 'E', 'Ģ' => 'G', 'Ī' => 'i', 'Ķ' => 'k', 'Ļ' => 'L', 'Ņ' => 'N',
            'Š' => 'S', 'Ū' => 'u', 'Ž' => 'Z',
            'ā' => 'a', 'č' => 'c', 'ē' => 'e', 'ģ' => 'g', 'ī' => 'i', 'ķ' => 'k', 'ļ' => 'l', 'ņ' => 'n',
            'š' => 's', 'ū' => 'u', 'ž' => 'z'
        );

        // Make custom replacements
        $str = preg_replace(array_keys($options['replacements']), $options['replacements'], $str);

        // Transliterate characters to ASCII
        if ($options['transliterate']) {
            $str = str_replace(array_keys($char_map), $char_map, $str);
        }

        // Replace non-alphanumeric characters with our delimiter
        $str = preg_replace('/[^\p{L}\p{Nd}]+/u', $options['delimiter'], $str);

        // Remove duplicate delimiters
        $str = preg_replace('/(' . preg_quote($options['delimiter'], '/') . '){2,}/', '$1', $str);

        // Truncate slug to max. characters
        $str = mb_substr($str, 0, ($options['limit'] ? $options['limit'] : mb_strlen($str, 'UTF-8')), 'UTF-8');

        // Remove delimiter from ends
        $str = trim($str, $options['delimiter']);

        return $options['lowercase'] ? mb_strtolower($str, 'UTF-8') : $str;
    }

**ThW** · Answer 2 · 2024-01-24T16:33:33.037000

preg_match() allows to use unicode scripts:

Latin script: \p{Latin}
At least one char: \p{Latin}+
Anchor to string start/end: ^\p{Latin}+$
Pattern delimiters: (^\p{Latin}+$)
Disallow linefeed at string end: (^\p{Latin}+$)D
Unicode (UTF-8) mode: (^\p{Latin}+$)Du

$values = ['English', 'አማርኛ', 'Anarâškielâ', 'अंगिका', 'Аԥсшәа', 'Aragonés', 'অসমীয়া'];

foreach ($values as $value) {
  $matched = preg_match('(^\\p{Latin}+$)Du', $value);
  echo $value, ' ', ($matched ? '✔️' : '❌'), "\n";
}

Output:

English ✔️
አማርኛ ❌
Anarâškielâ ✔️
अंगिका ❌
Аԥсшәа ❌
Aragonés ✔️
অসমীয়া ❌

For transliteration check the Transliterator class. It is parts of PHPs standard unicode extension - ext/intl. It allows for extensive transformations of unicode strings.

$transliterator = \Transliterator::create('Any-Latin'); 
var_dump($transliterator->transliterate('አማርኛ Anarâškielâ अंगिका Аԥсшәа Aragonés অসমীয়া'));
$transliterator = \Transliterator::create('Any-Latin; Latin-ASCII'); 
var_dump($transliterator->transliterate('አማርኛ Anarâškielâ अंगिका Аԥсшәа Aragonés অসমীয়া'));

Output:

string(69) "አማርኛ Anarâškielâ aṅgikā Aԥsšəa Aragonés asamīẏā"
string(57) "አማርኛ Anaraskiela angika Aԥssəa Aragones asamiya"

The first (untransformed) word in the example is Amharisch. Even ICU has limits depending on the version.

More about the ICU Script Transliterations: https://unicode-org.github.io/icu/userguide/transforms/general/#scriptlanguage

**Casimir et Hippolyte** · Answer 3 · 2024-01-25T20:47:54.540000

You can use the script run unicode property:

$pattern = '~
  \A
    (?=\p{Latin}) 
    (*asr: [\pL\pM ]+)
  \z
~xu';

$tests = [
    'Das Wohltemperierte Klavier',
    'Добре темперирано пиано',
    'Le Clavier bien tempére'."\xcc\x81", // <-- combining acute
    'Il clavicembalo ben temperato',
    '平均律クラヴィーア曲集',
    'El clavecí ben temprat',
    'เดอะเวลล์-เทมเพิร์ดคลาเวียร์',
    'Eliška Koňaříková',
    'Eliška Koňaříková 1 2 3',
    
];

print_r(preg_grep($pattern, $tests));

demo

PHP allow all accented characters in person name, but don't allow Chinese/Russian characters

There are 3 best solutions below

Related Questions in PHP

Related Questions in REGEX

Related Questions in CHARACTER-ENCODING

Related Questions in PREG-MATCH

Trending Questions

Popular # Hahtags

Popular Questions