R: Transform Cyrillic Unicode to Latin Text

589 Views Asked by At

I have some unicode text gathered from a website which in Cyrillic using R selenium, the language is Serbian.

A sample of the unicode text is in this form:

<U+041A><U+0440><U+0430><U+0433><U+0443><U+0458><U+0435><U+0432><U+0430><U+0446> <U+0410><U+0421>

Further, I have the text gathered as a table. Where the above unicode text would be a single row, while other columns/rows may already be in Latin Alphabet.

I have been at this for hours, and am trying to:

  1. Either transform the unicode to Cyrillic, or
  2. Transform the unicode directly to Latin Alphabet

My latest attempt was using the stringi package, but did not work: stringi::stri_trans_general(Table_save,"latin-ascii")

Any advice would be greatly appreciated! Thank you

1

There are 1 best solutions below

0
On

I hope my solution helps. First, as you probably know, the text U+041A is a hexadecimal code. I want to emphasize that, because I think is a bad idea to convert these codes to Cyrillic language. What I think is best, is to work with your text, through the hexadecimal Unicodes. In other words, have the unicodes of the letters in mind, not the letters per se, when working with the text.

This way, is gonna be easier to do regex, and other transformations in your text. When you want to read your text as Cyrillic, you just need to ask R, to interpret your vector of Unicodes, as UTF-8 text, through a function like intToUtf8().

The first thing you need to do, is to separate each Cyrillic word. So you want to detect each white space in your text, and them, substitute that space, by his respective unicode (yes, even white spaces have an unicode). After that, you need to separate each letter (before I was separating each word, now I want to separate each letter, or character that forms your phrase).

Next, I need to eliminate other metacharacters (> and +), and leave only the hexadecimal code, in each element of vector a. After that, I just substitute each letter U, for a 0x, to isolate just the hexadecimal part of the Unicode. This way is easier, because for read the code U041A as a Unicode, I need to insert a single backslash (resulting in \U041A), before the U, and I was struggling to do that. After these steps, each element of vector a, is a character (or a letter) that forms your phrase.

library(tibble)
library(stringr)
text <- "<U+041A><U+0440><U+0430><U+0433><U+0443><U+0458><U+0435><U+0432><U+0430><U+0446> <U+0410><U+0421>" 

a <- str_replace_all(text, " ", replacement = "<U+0020>") # replace white spaces

a <- unlist(str_split(a, "[<]"))

a <- a[-1]

  
a <- str_replace_all(a, ">", "")
  
a <- str_replace_all(a, "\\+", "")
  
a <- str_replace_all(a, "U", "0x")


intToUtf8(a)

[1] "Крагујевац АС"