I am trying to figure out the best way to split strings (words) to individual phones using R, but I haven't been able to come up with a good solution. I am aware that one sollution would be to use gruut-ipa module but I cannot shake the feeling that there is a simple way to do this with R which I just cannot figure out.
IPA symbols consist on multiple combining and non-combining characters. IPA symbol structure (Photo from gruut-ipa github.
I am using Panphon data as the basis of the ipa characters. The full list consists of 6,487 entries.
example_sample <- c("ʔpoɣʔe","mtoto","nukapːiaʁaq","boobal","tamaru")
example_ipa <- c("ḁː","b͡dːˤ","k","k͡pˠ","ʁ","o","ʔ","pː","p")
The goal is to recognise and split the words into individual phones, so in these examples "nukapːiaʁaq" should become "n_u_k_a_pː_i_a_ʁ_a_q" instead of n_u_k_a_p_ː_i_a_ʁ_a_q" (so not just recognise one character).
I have been testing around with purrr, stringr, and stringi but haven't figured out a way which would yield good results.
not sure if this solves the task - Unfortunately I am note aware of the IPA symbols specificalities.
now there might be other "special symbols" in the charset you are using (I suspect that is the case since you have two sets), which you possiblity want to include in the last step (you need a caputre group to call in the replacement part):
There seem to be specific regex symbols for the IPA charset: \\b though I am not sure if and how this is implemented in R, since \\b is already reserved for word boundries from what I understand.