Extract proper nouns from text in R?

1.6k Views Asked by At

Is there any better way of extracting proper nouns (e.g. "London", "John Smith", "Gulf of Carpentaria") from free text?

That is, a function like

proper_nouns <- function(text_input) {
  # ...
}

such that it would extract a list of proper nouns from the text input(s).

Examples

Here is a set of 7 text inputs (some easy, some harder):

text_inputs <- c("a rainy London day",
  "do you know John Smith?",
  "sail the Adriatic",
  
  # tougher examples
  
  "Hey Tom, where's Fred?" # more than one proper noun in the sentence
  "Hi Lisa, I'm Joan." # more than one proper noun in the sentence, separated by capitalized word
  "sail the Gulf of Carpentaria", # proper noun containing an uncapitalized word
  "The great Joost van der Westhuizen." # proper noun containing two uncapitalized words
  )

And here's what such a function, set of rules, or AI should return:

proper_nouns(text_inputs)

[[1]]
[1] "London"

[[2]]
[1] "John Smith" 

[[3]]
[1] "Adriatic"

[[4]]
[1] "Tom"    "Fred"

[[5]]
[1] "Lisa"    "Joan"

[[6]]
[1] "Gulf of Carpentaria"

[[7]]
[1] "Joost van der Westhuizen"

Problems: simple regex are imperfect

Consider some simple regex rules, which have obvious imperfections:

  • Rule: take capitalized words, unless they're the first word in the sentence (which would ordinarily be capitalized). Problem: will miss proper nouns at start of sentence.

  • Rule: assume successive capitalized words are parts of the same proper noun (multi-part proper nouns like "John Smith"). Problem: "Gulf of Carpentaria" would be missed since it has an uncapitalized word in between.

    • Similar problem with people's names containing uncapitalized words, e.g. "Joost van der Westhuizen".

Question

The best approach I currently have is to simply use the regular expressions above and make do with a low success rate. Is there a better or more accurate way to extract the proper nouns from text in R? If I could get 80-90% accuracy on real text, that would be great.

1

There are 1 best solutions below

3
On

You can start by taking a look at spacyr library.

library(spacyr)
result <- spacy_parse(text_inputs, tag = TRUE, pos = TRUE)
proper_nouns <- subset(result, pos == 'PROPN')
split(proper_nouns$token, proper_nouns$doc_id)

#$text1
#[1] "London"

#$text2
#[1] "John"  "Smith"

#$text3
#[1] "Adriatic"

#$text4
#[1] "Hey" "Tom"

#$text5
#[1] "Lisa" "Joan"

#$text6
#[1] "Gulf"        "Carpentaria"

This treats every word separately hence "John" and "Smith" are not combined. You maybe need to add some rules on top of this and do some post-processing if that is what you require.