How to augment udpipe models with custom dictionary?

167 Views Asked by At

Is there a way to add a dictionary of custom user defined words to the udpipe models?

For example, below using the default english model, some of the words should have been identified as the keywords, such as R, Python, SQL, javascript, Excel, noSQL.

I would like to augment the default english model with my own custom words, so that the textrank_keywords function will be able to better identify relevant keywords.

library(udpipe)
library(dplyr)
tagger <- udpipe_download_model("english")
tagger <- udpipe_load_model(tagger$file_model)

# read data
rawdata <- c("Automating and R/Python package development.","You have a sound knowledge of another data analysis language (R,Python, SQL, javascript) and you don't care in which relational database, Excel, bigdata or noSQL store your data is located.")

# annotate
rawdata_annotate <- udpipe_annotate(tagger, rawdata) %>% as_tibble()

keyw <- textrank_keywords(rawdata_annotate$lemma,
                          relevant = rawdata_annotate$upos %in% c("PROPN","NOUN", "VERB", "ADJ"))

have <- keyw$terms
[1] "package"    "analysis"   "sound"      "relational"

rawdata_annotate %>% dplyr::filter(token %in% c('R', 'Python', 'SQL', 'javascript', 'Excel', 'noSQL')) %>% dplyr::select(token, lemma, upos)

  token      lemma      upos 
  <chr>      <chr>      <chr>
1 R          R          PROPN
2 Python     python     NOUN 
3 R          r          NOUN 
4 Python     python     NOUN 
5 SQL        sql        NOUN 
6 javascript javascript NOUN 
7 Excel      Excel      PROPN
8 noSQL      nosql      AUX  

1

There are 1 best solutions below

0
On

I think I found the answer. Basically I would need to create a custom CONLL-U file for the custom annotation. And then train the model.

https://bnosac.github.io/udpipe/docs/doc3.html