Extract Wikipedia Entities from Text

590 Views Asked by At

Is there any way we can extract all the wikipedia entities from the text using Wikipedia2Vec? Or is there any other way to do the same.

Example:

Text : "Scarlett Johansson is an American actress."  
Entities : [ 'Scarlett Johansson' , 'American' ]

I want to do it in Python

Thanks

2

There are 2 best solutions below

0
Jindřich On

You can use spacy:

import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = doc = nlp('Scarlett Johansson is an American actress.')
print([(X.text, X.label_) for X in doc.ents])

And you get:

[('Scarlett Johansson', 'PERSON'), ('American', 'NORP')]

Find more in spacy documentation.

0
alvas On

Here's an NLTK version (may not be as good as SpaCy):

from nltk import Tree
from nltk import ne_chunk, pos_tag, word_tokenize

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk


text = 'Scarlett Johansson is an American actress.'
get_continuous_chunks(text)