I'm trying to make a transliteration using PHP, but what I need is the conversion of all non-latin characters but keep the italian accented characters (àèìòù).
PHP Transliterator lacks of documentation and on-line examples.
I've read the ICU docs and I know that there is a rule that force Transliterator to convert a char into another specified by us (à > b).
The code (using the create funciton)
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
echo $transliterator->transliterate($str);
converts all non-latin chars into latin (with all the accented chars) and gives the result
ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
and the code (using createFromRules function)
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::createFromRules("á>b");
echo $transliterator->transliterate($str);
forces correctly the conversion of à into b, but, obviously, without the conversion Any-Latin; Latin-ASCII made by the previous code, giving the result
AŠAbèìòù Chén Hǎi ybo München Faißt Финиш 国内 - 镜像
So my goal is to merge the Any-Latin; Latin-ASCII conversion and the à > à rule (and the other italian accented vowels), in order to tell Transliterator to convert all non latin chars to latin, but convert italian accented vowels into themselves, with the following result:
ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
Is there a way to put the à>à rule in the create function's parameter or add the Any-Latin; Latin-ASCII directive in the createFromRules function's parameter?
Given your example with input and output:
when applying the transliteration only on the segments that do not match the range of characters you specified to keep (the italian accented characters [àèìòù]) it should provide the result.
One option is to use
preg_replace_callbackfor that.It requires to have a callback to apply the transliteration:
And it requires to have a pattern to match everything but the characters to keep. It needs to be properly defined and compatible with Unicode:
Last but not least, the subject to operate on must be compatible with the characters to keep. As there can be many ways to write the same character in Unicode, the input is normalized to be compatible with the PCRE pattern:
The output:
Example across PHP versions.
Addendum:
\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFAlower-case list of italian accented characters (can be used with i-modifier)\xC0\xC1\xC8\xC9\xCC\xCD\xD2\xD3\xD9\xDA\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFAlower- and upper-case list of italian accented characters (can be used without i-modifier)