How to change the default setting in case I would like to keep the hashtag symbol and its word intact ( i.e. #company and not # and company)
x_mod <- udpipe_load_model("D:/Users/asongara/Documents/english-ewt-ud-2.3-181115.udpipe")
ud_model <- udpipe_load_model(x_mod$file)
anno_op3 <- udpipe_annotate(ud_model,
"This is a better #company than i thought @mr_jones!",
tokenizer = "tokenizer",
tagger = "default",
trace = TRUE)
anno_op3 <- as.data.table(as.data.frame(anno_op3))
View(anno_op3)
What i am getting is # and company as two different tokens. I want #company as a single token. Although i am getting @mr_jones as a single token.
You can combine other tokenisation tools with the udpipe R package. This is shown at https://bnosac.github.io/udpipe/docs/doc2.html. E.g. below a tokeniser specific to twitter messages is used and after that parts of speech tagging, morphological feature annotation and dependency parsing is done with udpipe