I have some html files (I'm working with them as plain texts) that utilize decimal NCR to encode special characters. Is there a way to convert them conveniently to Unicode using R?
NCR codes does not always have one-on-one match with unicode and it becomes quite confusing, since ѣ
is equal not to \u1123
, but to \u0463:
> stri_unescape_unicode("\u1123")
[1] "ᄣ"
and
> stri_unescape_unicode("\u0463")
[1] "ѣ"
1123
is the decimal equivalent of the hexadecimal0463
, and Unicode uses hexadecimal. So in order to get a conversion, you need to strip out the non-digit characters, convert the digits to hex characters, stick a "\u" in front of them then usestri_unescape_unicode
.This function will do all that:
Now you can do