I am currently analyzing Instagram postings which often have hashtags containing more than one word (e.g. #pictureoftheday).
However, tokenizing them within the R package tidytext results in only one token. Instead, I would like to have more than one token like "picture" "of" "the" "day". Unfortunately, I have not found a package capable of doing so.
Do you know any R package allowing this approach?
Thanks in advance!
As far as I know you can't split joined words without knowing they are just that--words. If the hashtags were split by a delimiter then it would be easy; without it it becomes very complex. You need a language-dependent dictionary.
You probably have to process your data separately. Creating your own dictionary-based method is often a good solution, but it is very time intensive.
See also: