Is there any better way of extracting proper nouns (e.g. "London", "John Smith", "Gulf of Carpentaria") from free text?
That is, a function like
proper_nouns <- function(text_input) {
# ...
}
such that it would extract a list of proper nouns from the text input(s).
Examples
Here is a set of 7 text inputs (some easy, some harder):
text_inputs <- c("a rainy London day",
"do you know John Smith?",
"sail the Adriatic",
# tougher examples
"Hey Tom, where's Fred?" # more than one proper noun in the sentence
"Hi Lisa, I'm Joan." # more than one proper noun in the sentence, separated by capitalized word
"sail the Gulf of Carpentaria", # proper noun containing an uncapitalized word
"The great Joost van der Westhuizen." # proper noun containing two uncapitalized words
)
And here's what such a function, set of rules, or AI should return:
proper_nouns(text_inputs)
[[1]]
[1] "London"
[[2]]
[1] "John Smith"
[[3]]
[1] "Adriatic"
[[4]]
[1] "Tom" "Fred"
[[5]]
[1] "Lisa" "Joan"
[[6]]
[1] "Gulf of Carpentaria"
[[7]]
[1] "Joost van der Westhuizen"
Problems: simple regex are imperfect
Consider some simple regex rules, which have obvious imperfections:
Rule: take capitalized words, unless they're the first word in the sentence (which would ordinarily be capitalized). Problem: will miss proper nouns at start of sentence.
Rule: assume successive capitalized words are parts of the same proper noun (multi-part proper nouns like
"John Smith"
). Problem:"Gulf of Carpentaria"
would be missed since it has an uncapitalized word in between.- Similar problem with people's names containing uncapitalized words, e.g.
"Joost van der Westhuizen"
.
- Similar problem with people's names containing uncapitalized words, e.g.
Question
The best approach I currently have is to simply use the regular expressions above and make do with a low success rate. Is there a better or more accurate way to extract the proper nouns from text in R? If I could get 80-90% accuracy on real text, that would be great.
You can start by taking a look at
spacyr
library.This treats every word separately hence
"John"
and"Smith"
are not combined. You maybe need to add some rules on top of this and do some post-processing if that is what you require.