Why do characters from foreign alphabets not show in my wordcloud on R?

244 Views Asked by At

I am trying to create a wordcloud with 'Thank You' in different languages. Somehow, some characters don't show on the plot, however.

library(ggplot2)

data("thankyou_words_small")

set.seed(42)
data("thankyou_words_small")
ggplot(
  thankyou_words_small,
  aes(label = word, size = speakers, color = speakers)) +
  geom_text_wordcloud(area_corr = TRUE, rm_outside = TRUE) +
  scale_size_area(max_size = 24) +
  theme_minimal() +
  scale_color_gradient(low = "darkblue", high = "lightblue")

enter image description here

I tried using ggwordcloud instead of ggplot, which didn't work either. Please keep your answers simple as I am just a beginner using R. Thank you :)

1

There are 1 best solutions below

0
On

Alright, since you're using Mac, I think I can give you the answer you're looking for!

This will require a bit of finesse on your end because I've installed a lot of fonts. So what I see is unlikely to be exactly what you see. (You won't need to install fonts for this to work.)

First, I'm using the library showtext. If you don't have that installed, you'll need it.

This first function call is where you may see something different than what I see.

library(showtext)
library(ggwordcloud)
library(tidyverse)

(where <- font_files()[which(str_detect(font_files()$family, "Arial Unicode MS")), ])
#                                  path              file           family
# 1                      /Library/Fonts Arial Unicode.ttf Arial Unicode MS
# 74 /System/Library/Fonts/Supplemental Arial Unicode.ttf Arial Unicode MS
#       face       version        ps_name
# 1  Regular Version 1.01x ArialUnicodeMS
# 74 Regular Version 1.01x ArialUnicodeMS 

As you can see this returned two rows for me. I'm going to call for only the first row since these two lines are identical.

If you only return 1 line, then drop the brackets. If your call returned many lines, just make sure that the line you select is "Regular" for "face".

# add the font to the workspace
font_add(family = where[1, ]$family, regular = where[1, ]$file)

# if only returned one line use this instead
font_add(family = where$family, regular = where$file)

To use this font, you can either call showtext_begin() or showtext_auto() I really have not seen any difference between the two. Then call the plot. When you call the plot, you need to include the family in geom_text_wordcloud, I used where[1, ]$family, but you can just copy the string, too.

showtext_auto()

data("thankyou_words_small")

ggplot(thankyou_words_small, 
       aes(label = word, size = speakers, color = speakers)) + 
  geom_text_wordcloud(area_corr = T, rm_outside = T, 
                      family = where[1, ]$family) +
  scale_size_area(max_size = 24) + 
  scale_color_gradient(low = "darkblue", high = "lightblue") + 
  theme_minimal()

enter image description here

It does say that you're supposed to end or close the showtext with either showtext_end() or showtext_auto(F). However, I've never had an issue if I forgot or left it out intentionally.

There are some other errors in this data for example, 'shukran' or thanks in MS Arabic is شكرا However, this is plotting the text backward. In Pashtoon, usually 'manana' is used for thank you (literally it means acceptance), which is written مننه. That's DEFINITELY not what's in this dataset. It's probably backward. (It's not a word in Pashtoon as it is.)

I thought the reversal of these was due to the left-to-right thing, but Hindi is correct, Japanese is correct, Gujarati is correct...Urdu is incorrect. Sigh. I couldn't find a font that did this any better. I found a way to flip the words, but Farsi is still incorrect. For example, مننه if written one letter at a time is م ن ن ه. That's what's happening with Farsi. (Farsi == Persian)

Here's the manual correctly for these words

flips <- c(4, 14, 16, 30, 31, 34)
tyws <- thankyou_words_small
tyws[flips, ]$word <- stringi::stri_reverse(tyws[flips, ]$word)
set.seed(42)
ggplot(tyws, 
       aes(label = word, size = speakers, color = speakers)) + 
  geom_text_wordcloud(area_corr = T, rm_outside = T, 
                      family = where[1, ]$family) +
  scale_size_area(max_size = 24) + 
  scale_color_gradient(low = "darkblue", high = "lightblue") + 
  theme_minimal()

Here's the difference:

enter image description here