How to extract entities names with SpacyR with personalized data?

167 Views Asked by At

Good afternoon,

I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.

The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.

How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?

Thanks in advance.

I have done the POS task in this way. I generated a couple of functions.

suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))

# load the corpus

tm_corpus <- VCorpus(DirSource(
  "working_path,
  pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))

# load udpipe

library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)

# functions to annotate the corpus

f_udpipe_anot <- function(n){
  
  txt <- as.character(tm_corpus[[n]]) %>% #lista simia
    unlist()
  y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
  y <- as.data.frame(y)
}

pinkillazo <- function(desde, hasta){
  resultado <- data.frame()
  for (item in desde:hasta){
    print(item)
    resultado <- rbind(resultado, f_udpipe_anot(item))
   
   }
  return(resultado)
}

leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe

To identify the named entities, I have tried this:

spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)

organiz <- spacy_extract_entity(
  quan_corpus,
  output = c("data.frame", "list"),
  type = c("all", "named", "extended"),
  multithread = TRUE,
  )

I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.

1

There are 1 best solutions below

2
On

If you want to train your own named entity recognition model in R, you could use R packages crfsuite and R package nametagger which are respectively Conditional Random Fields and Maximum Entropy Models which can be used alongside the udpipe annotation.

If you want deep learning models, you might have to look into torch alongside tokenisers like sentencepiece and embedding techniques like word2vec to implement your own modelling flow (e.g. BiLSTM).