Detecting International Phonetic Alphabet (IPA) symbols / character blocks from word strings using R

335 Views Asked by At

I am trying to figure out the best way to split strings (words) to individual phones using R, but I haven't been able to come up with a good solution. I am aware that one sollution would be to use gruut-ipa module but I cannot shake the feeling that there is a simple way to do this with R which I just cannot figure out.

IPA symbols consist on multiple combining and non-combining characters. IPA symbol structure (Photo from gruut-ipa github.

I am using Panphon data as the basis of the ipa characters. The full list consists of 6,487 entries.

example_sample <- c("ʔpoɣʔe","mtoto","nukapːiaʁaq","boobal","tamaru")
example_ipa <- c("ḁː","b͡dːˤ","k","k͡pˠ","ʁ","o","ʔ","pː","p")

The goal is to recognise and split the words into individual phones, so in these examples "nukapːiaʁaq" should become "n_u_k_a_pː_i_a_ʁ_a_q" instead of n_u_k_a_p_ː_i_a_ʁ_a_q" (so not just recognise one character).

I have been testing around with purrr, stringr, and stringi but haven't figured out a way which would yield good results.

1

There are 1 best solutions below

1
DPH On

not sure if this solves the task - Unfortunately I am note aware of the IPA symbols specificalities.

# for convienience of pipeing/cueing the function calls
library(dplyr)
# subtitute everyting with an underline
gsub(pattern = "*", replacement = "_", example_ipa) %>% 
    # remove trailing and leading underlines
    gsub(pattern = "^_|_$", replacement = "") %>% 
    # solve the _ before special symbol ː by replacement
    gsub(pattern = "_ː", replacement = "ː")

[1] "ʔ_p_o_ɣ_ʔ_e"          "m_t_o_t_o"            "n_u_k_a_pː_i_a_ʁ_a_q" "b_o_o_b_a_l"          "t_a_m_a_r_u"   

now there might be other "special symbols" in the charset you are using (I suspect that is the case since you have two sets), which you possiblity want to include in the last step (you need a caputre group to call in the replacement part):

gsub(pattern = "*", replacement = "_", example_ipa) %>% 
    gsub(pattern = "^_|_$", replacement = "") %>% 
    # with the or | you can chain symbols and the pharenthis are used for the caputre group \\1
    # I had to introduce a space after the second special symbol as it is needed to show properly - be sure to remove if it shows up 
    gsub(pattern = "_(ː|͡ )", replacement = "\\1")

[1] "a_̥ː"    "b͡ _dː_ˤ" "k"      "k͡ _p_ˠ"  "ʁ"      "o"      "ʔ"      "pː"     "p" 

There seem to be specific regex symbols for the IPA charset: \\b though I am not sure if and how this is implemented in R, since \\b is already reserved for word boundries from what I understand.