PHP allow all accented characters in person name, but don't allow Chinese/Russian characters

99 Views Asked by At

I am having issues with allowing all English/Latin based characters (including accents), but disallowing Chinese/Russian characters.

The first version I had was as follows:

strlen($values['person_name']) != mb_strlen($values['person_name'], 'utf-8')

This one worked fine initially, but when Icelandic/Czech names came into play, this did not work anymore.

The second version I had was as follows:

preg_match("~^[a-zÀ-ÿ][\'a-zÀ-ÿ \-]*$~i", $values['person_name'])

This seemed to work fine for majority of cases, but it is giving an error on a test name

Eliška Koňaříková

I have tried the following as well without any luck:

preg_match("/[^\w ]/u", $values['person_name'])      //does not allow š
preg_match("/\PL/u", $values['person_name'])      //does not allow š
preg_match("/^[a-zA-Z\s,.'\-\pL]+$/u", $values['person_name'])      //allows š, but also allows 書
preg_match("/^[\s,.'-]*\p{L}[\p{L}\s,.'-]*$/u", $values['person_name'])      //allows š, but also allows 書
preg_match("/[^a-zA-Z0-9àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý,. ]/u", $values['person_name'])      //allows š, but also allows 書
preg_match("~^[a-zÀ-ÿ][\'a-zÀ-ÿ \-]*$~iu", $values['person_name'])      //does not allow š
preg_match("/^[\p{L}-]*$/u", $values['person_name'])      //allows š, but also allows 書
preg_match("/([\w ]{2,})/u", $values['person_name'])      //allows š, but also allows 書
preg_match('/[^\p{Latin}0-9€, !"§$%&\/()=#|<>]/u', $values['person_name'])      //allows š, but also allows 書

All of the above either failed with the name provided, or it allowed Chinese characters.

I believe the best route for me would be to revert back to the check that was working for most characters (except with the Czech names that are giving an error):

preg_match("~^[a-zÀ-ÿ][\'a-zÀ-ÿ \-]*$~i", $values['person_name'])

And manually add the Czech characters that are not accepted such as š, ň, ř, etc.

Is there a cleaner solution than manually having to specify each of these characters?

3

There are 3 best solutions below

0
Milad Elyasi On

maybe it's better to replace the chars, this is only an example of doing that and it's not a complete function:

<?php
replace($str, $options = array())
    {

        // Make sure string is in UTF-8 and strip invalid UTF-8 characters
        $str = mb_convert_encoding((string)$str, 'UTF-8', mb_list_encodings());

        $defaults = array(
            'delimiter' => '',
            'limit' => null,
            'lowercase' => true,
            'replacements' => array(),
            'transliterate' => false,
        );

        // Merge options
        $options = array_merge($defaults, $options);

        $char_map = array(
            // Latin
            'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C',
            'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I',
            'Ð' => 'D', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', 'Ő' => 'O',
            'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'Ű' => 'U', 'Ý' => 'Y', 'Þ' => 'TH',
            'ß' => 'ss',
            'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae', 'ç' => 'c',
            'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i',
            'ð' => 'd', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ő' => 'o',
            'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ű' => 'u', 'ý' => 'y', 'þ' => 'th',
            'ÿ' => 'y',
            // Latin symbols
            '©' => '(c)',
            // Greek
            'Α' => 'A', 'Β' => 'B', 'Γ' => 'G', 'Δ' => 'D', 'Ε' => 'E', 'Ζ' => 'Z', 'Η' => 'H', 'Θ' => '8',
            'Ι' => 'I', 'Κ' => 'K', 'Λ' => 'L', 'Μ' => 'M', 'Ν' => 'N', 'Ξ' => '3', 'Ο' => 'O', 'Π' => 'P',
            'Ρ' => 'R', 'Σ' => 'S', 'Τ' => 'T', 'Υ' => 'Y', 'Φ' => 'F', 'Χ' => 'X', 'Ψ' => 'PS', 'Ω' => 'W',
            'Ά' => 'A', 'Έ' => 'E', 'Ί' => 'I', 'Ό' => 'O', 'Ύ' => 'Y', 'Ή' => 'H', 'Ώ' => 'W', 'Ϊ' => 'I',
            'Ϋ' => 'Y',
            'α' => 'a', 'β' => 'b', 'γ' => 'g', 'δ' => 'd', 'ε' => 'e', 'ζ' => 'z', 'η' => 'h', 'θ' => '8',
            'ι' => 'i', 'κ' => 'k', 'λ' => 'l', 'μ' => 'm', 'ν' => 'n', 'ξ' => '3', 'ο' => 'o', 'π' => 'p',
            'ρ' => 'r', 'σ' => 's', 'τ' => 't', 'υ' => 'y', 'φ' => 'f', 'χ' => 'x', 'ψ' => 'ps', 'ω' => 'w',
            'ά' => 'a', 'έ' => 'e', 'ί' => 'i', 'ό' => 'o', 'ύ' => 'y', 'ή' => 'h', 'ώ' => 'w', 'ς' => 's',
            'ϊ' => 'i', 'ΰ' => 'y', 'ϋ' => 'y', 'ΐ' => 'i',
            // Turkish
            'Ş' => 'S', 'İ' => 'I', 'Ç' => 'C', 'Ü' => 'U', 'Ö' => 'O', 'Ğ' => 'G',
            'ş' => 's', 'ı' => 'i', 'ç' => 'c', 'ü' => 'u', 'ö' => 'o', 'ğ' => 'g',
            // Russian
            'А' => 'A', 'Б' => 'B', 'В' => 'V', 'Г' => 'G', 'Д' => 'D', 'Е' => 'E', 'Ё' => 'Yo', 'Ж' => 'Zh',
            'З' => 'Z', 'И' => 'I', 'Й' => 'J', 'К' => 'K', 'Л' => 'L', 'М' => 'M', 'Н' => 'N', 'О' => 'O',
            'П' => 'P', 'Р' => 'R', 'С' => 'S', 'Т' => 'T', 'У' => 'U', 'Ф' => 'F', 'Х' => 'H', 'Ц' => 'C',
            'Ч' => 'Ch', 'Ш' => 'Sh', 'Щ' => 'Sh', 'Ъ' => '', 'Ы' => 'Y', 'Ь' => '', 'Э' => 'E', 'Ю' => 'Yu',
            'Я' => 'Ya',
            'а' => 'a', 'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e', 'ё' => 'yo', 'ж' => 'zh',
            'з' => 'z', 'и' => 'i', 'й' => 'j', 'к' => 'k', 'л' => 'l', 'м' => 'm', 'н' => 'n', 'о' => 'o',
            'п' => 'p', 'р' => 'r', 'с' => 's', 'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c',
            'ч' => 'ch', 'ш' => 'sh', 'щ' => 'sh', 'ъ' => '', 'ы' => 'y', 'ь' => '', 'э' => 'e', 'ю' => 'yu',
            'я' => 'ya',
            // Ukrainian
            'Є' => 'Ye', 'І' => 'I', 'Ї' => 'Yi', 'Ґ' => 'G',
            'є' => 'ye', 'і' => 'i', 'ї' => 'yi', 'ґ' => 'g',
            // Czech
            'Č' => 'C', 'Ď' => 'D', 'Ě' => 'E', 'Ň' => 'N', 'Ř' => 'R', 'Š' => 'S', 'Ť' => 'T', 'Ů' => 'U',
            'Ž' => 'Z',
            'č' => 'c', 'ď' => 'd', 'ě' => 'e', 'ň' => 'n', 'ř' => 'r', 'š' => 's', 'ť' => 't', 'ů' => 'u',
            'ž' => 'z',
            // Polish
            'Ą' => 'A', 'Ć' => 'C', 'Ę' => 'e', 'Ł' => 'L', 'Ń' => 'N', 'Ó' => 'o', 'Ś' => 'S', 'Ź' => 'Z',
            'Ż' => 'Z',
            'ą' => 'a', 'ć' => 'c', 'ę' => 'e', 'ł' => 'l', 'ń' => 'n', 'ó' => 'o', 'ś' => 's', 'ź' => 'z',
            'ż' => 'z',
            // Latvian
            'Ā' => 'A', 'Č' => 'C', 'Ē' => 'E', 'Ģ' => 'G', 'Ī' => 'i', 'Ķ' => 'k', 'Ļ' => 'L', 'Ņ' => 'N',
            'Š' => 'S', 'Ū' => 'u', 'Ž' => 'Z',
            'ā' => 'a', 'č' => 'c', 'ē' => 'e', 'ģ' => 'g', 'ī' => 'i', 'ķ' => 'k', 'ļ' => 'l', 'ņ' => 'n',
            'š' => 's', 'ū' => 'u', 'ž' => 'z'
        );

        // Make custom replacements
        $str = preg_replace(array_keys($options['replacements']), $options['replacements'], $str);

        // Transliterate characters to ASCII
        if ($options['transliterate']) {
            $str = str_replace(array_keys($char_map), $char_map, $str);
        }

        // Replace non-alphanumeric characters with our delimiter
        $str = preg_replace('/[^\p{L}\p{Nd}]+/u', $options['delimiter'], $str);

        // Remove duplicate delimiters
        $str = preg_replace('/(' . preg_quote($options['delimiter'], '/') . '){2,}/', '$1', $str);

        // Truncate slug to max. characters
        $str = mb_substr($str, 0, ($options['limit'] ? $options['limit'] : mb_strlen($str, 'UTF-8')), 'UTF-8');

        // Remove delimiter from ends
        $str = trim($str, $options['delimiter']);

        return $options['lowercase'] ? mb_strtolower($str, 'UTF-8') : $str;
    }
1
ThW On

preg_match() allows to use unicode scripts:

  • Latin script: \p{Latin}
  • At least one char: \p{Latin}+
  • Anchor to string start/end: ^\p{Latin}+$
  • Pattern delimiters: (^\p{Latin}+$)
  • Disallow linefeed at string end: (^\p{Latin}+$)D
  • Unicode (UTF-8) mode: (^\p{Latin}+$)Du
$values = ['English', 'አማርኛ', 'Anarâškielâ', 'अंगिका', 'Аԥсшәа', 'Aragonés', 'অসমীয়া'];

foreach ($values as $value) {
  $matched = preg_match('(^\\p{Latin}+$)Du', $value);
  echo $value, ' ', ($matched ? '✔️' : '❌'), "\n";
}

Output:

English ✔️
አማርኛ ❌
Anarâškielâ ✔️
अंगिका ❌
Аԥсшәа ❌
Aragonés ✔️
অসমীয়া ❌

For transliteration check the Transliterator class. It is parts of PHPs standard unicode extension - ext/intl. It allows for extensive transformations of unicode strings.

$transliterator = \Transliterator::create('Any-Latin'); 
var_dump($transliterator->transliterate('አማርኛ Anarâškielâ अंगिका Аԥсшәа Aragonés অসমীয়া'));
$transliterator = \Transliterator::create('Any-Latin; Latin-ASCII'); 
var_dump($transliterator->transliterate('አማርኛ Anarâškielâ अंगिका Аԥсшәа Aragonés অসমীয়া'));

Output:

string(69) "አማርኛ Anarâškielâ aṅgikā Aԥsšəa Aragonés asamīẏā"
string(57) "አማርኛ Anaraskiela angika Aԥssəa Aragones asamiya"

The first (untransformed) word in the example is Amharisch. Even ICU has limits depending on the version.

More about the ICU Script Transliterations: https://unicode-org.github.io/icu/userguide/transforms/general/#scriptlanguage

0
Casimir et Hippolyte On

You can use the script run unicode property:

$pattern = '~
  \A
    (?=\p{Latin}) 
    (*asr: [\pL\pM ]+)
  \z
~xu';

$tests = [
    'Das Wohltemperierte Klavier',
    'Добре темперирано пиано',
    'Le Clavier bien tempére'."\xcc\x81", // <-- combining acute
    'Il clavicembalo ben temperato',
    '平均律クラヴィーア曲集',
    'El clavecí ben temprat',
    'เดอะเวลล์-เทมเพิร์ดคลาเวียร์',
    'Eliška Koňaříková',
    'Eliška Koňaříková 1 2 3',
    
];

print_r(preg_grep($pattern, $tests));

demo