I am having issues with how R is handling characters in different languages. I have a multilanguage data set (PL, HR, EN, FR, GE, IT) and I created a keyword string to filter this. However, R is not recognizing all of my characters in every language but converts them which is problematic.
So imagine I would like to look for the word "łapać" in my data by using the string then R would filter for "lapac" and thus wouldn't find the necessary word, because in the database it has properly read the original word:
catch <- "łapać"
catch
[1] "lapac"
I tried out different things and for some characters/languages it is working. For example:
things <- "ćłßöüžỳđčšśęıчуй"
things
[1] "clßöüžỳdcšseiчуй"
As you see, some characters are displayed as they should be (ö,ü,ž and even the cyrillic ones like ч or й) others are converted (ćł to cl).
I tried reopening the document with different encoding and changing the encoding:
options(encoding = "utf-8")
Encoding(things) <- "UTF-8"
Also, I tried it with differen R versions on two different Windows computers.
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
locale:
[1] LC_COLLATE=German_Germany.1250 LC_CTYPE=German_Germany.1250
[3] LC_MONETARY=German_Germany.1250 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1250
When running
it works! See:
However not perfectly, as seen in
where the Turkish ı is still an i. Since I won't use Turkish, that's fine for me so far.
Thank you!