Extract proper nouns from text in R?

Question

Extract proper nouns from text in R?

1.6k Views Asked by stevec At 06 June 2025 at 10:36

Is there any better way of extracting proper nouns (e.g. "London", "John Smith", "Gulf of Carpentaria") from free text?

That is, a function like

proper_nouns <- function(text_input) {
  # ...
}

such that it would extract a list of proper nouns from the text input(s).

Examples

Here is a set of 7 text inputs (some easy, some harder):

text_inputs <- c("a rainy London day",
  "do you know John Smith?",
  "sail the Adriatic",
  
  # tougher examples
  
  "Hey Tom, where's Fred?" # more than one proper noun in the sentence
  "Hi Lisa, I'm Joan." # more than one proper noun in the sentence, separated by capitalized word
  "sail the Gulf of Carpentaria", # proper noun containing an uncapitalized word
  "The great Joost van der Westhuizen." # proper noun containing two uncapitalized words
  )

And here's what such a function, set of rules, or AI should return:

proper_nouns(text_inputs)

[[1]]
[1] "London"

[[2]]
[1] "John Smith" 

[[3]]
[1] "Adriatic"

[[4]]
[1] "Tom"    "Fred"

[[5]]
[1] "Lisa"    "Joan"

[[6]]
[1] "Gulf of Carpentaria"

[[7]]
[1] "Joost van der Westhuizen"

Problems: simple regex are imperfect

Consider some simple regex rules, which have obvious imperfections:

Rule: take capitalized words, unless they're the first word in the sentence (which would ordinarily be capitalized). Problem: will miss proper nouns at start of sentence.
Rule: assume successive capitalized words are parts of the same proper noun (multi-part proper nouns like "John Smith"). Problem: "Gulf of Carpentaria" would be missed since it has an uncapitalized word in between.
- Similar problem with people's names containing uncapitalized words, e.g. "Joost van der Westhuizen".

Question

The best approach I currently have is to simply use the regular expressions above and make do with a low success rate. Is there a better or more accurate way to extract the proper nouns from text in R? If I could get 80-90% accuracy on real text, that would be great.

Original Q&A

There are 1 best solutions below

**Ronak Shah** · Answer 1

You can start by taking a look at spacyr library.

library(spacyr)
result <- spacy_parse(text_inputs, tag = TRUE, pos = TRUE)
proper_nouns <- subset(result, pos == 'PROPN')
split(proper_nouns$token, proper_nouns$doc_id)

#$text1
#[1] "London"

#$text2
#[1] "John"  "Smith"

#$text3
#[1] "Adriatic"

#$text4
#[1] "Hey" "Tom"

#$text5
#[1] "Lisa" "Joan"

#$text6
#[1] "Gulf"        "Carpentaria"

This treats every word separately hence "John" and "Smith" are not combined. You maybe need to add some rules on top of this and do some post-processing if that is what you require.

Extract proper nouns from text in R?

Examples

Problems: simple regex are imperfect

Question

There are 1 best solutions below

Related Questions in R

Related Questions in NLP

Related Questions in TIDYTEXT

Trending Questions

Popular # Hahtags

Popular Questions