How to extract entities names with SpacyR with personalized data?

Question

How to extract entities names with SpacyR with personalized data?

177 Views Asked by Sergio A. Gottret Rios At 07 June 2025 at 18:44

Good afternoon,

I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.

The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.

How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?

Thanks in advance.

I have done the POS task in this way. I generated a couple of functions.

suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))

# load the corpus

tm_corpus <- VCorpus(DirSource(
  "working_path,
  pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))

# load udpipe

library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)

# functions to annotate the corpus

f_udpipe_anot <- function(n){
  
  txt <- as.character(tm_corpus[[n]]) %>% #lista simia
    unlist()
  y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
  y <- as.data.frame(y)
}

pinkillazo <- function(desde, hasta){
  resultado <- data.frame()
  for (item in desde:hasta){
    print(item)
    resultado <- rbind(resultado, f_udpipe_anot(item))
   
   }
  return(resultado)
}

leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe

To identify the named entities, I have tried this:

spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)

organiz <- spacy_extract_entity(
  quan_corpus,
  output = c("data.frame", "list"),
  type = c("all", "named", "extended"),
  multithread = TRUE,
  )

I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.

Original Q&A

There are 1 best solutions below

**AudioBubble** · Answer 1

If you want to train your own named entity recognition model in R, you could use R packages crfsuite and R package nametagger which are respectively Conditional Random Fields and Maximum Entropy Models which can be used alongside the udpipe annotation.

If you want deep learning models, you might have to look into torch alongside tokenisers like sentencepiece and embedding techniques like word2vec to implement your own modelling flow (e.g. BiLSTM).

How to extract entities names with SpacyR with personalized data?

There are 1 best solutions below

Related Questions in R

Related Questions in SPACY

Related Questions in TM

Related Questions in QUANTEDA

Related Questions in UDPIPE

Trending Questions

Popular # Hahtags

Popular Questions