spacy aggressive lemmatization and removing unexpected words

2.5k Views Asked by At

I am trying to clean some text data. fisrt i removed the stop words, then i tried to Lemmatize the text. But words such as nouns are removed

Sample Data

https://drive.google.com/file/d/1p9SKWLSVYeNScOCU_pEu7A08jbP-50oZ/view?usp=sharing udpated Code

# Libraries  
import spacy
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['covid', 'COVID-19', 'coronavirus'])

article= pd.read_csv("testdata.csv")
data = article.title.values.tolist()
nlp = spacy.load('en_core_web_sm')

def sent_to_words(sentences):
    for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
print ("*** Text  After removing Stop words:   ")
print(data_words_nostops)
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PRON'])
print ("*** Text  After Lemmatization:   ")

print(data_lemmatized)

The output after removing Stopwords is :

[['qaia', 'flags', 'amman', 'melbourne', 'jetstar', 'flights', 'recovery', 'plan'],
['western', 'amman', 'suburb', 'new','nsw', 'ground', 'zero', children],
['flight', 'returned', 'amman','qaia', 'staff', 'contract','driving'], ]]

The output after Lematization :

[['flight', 'recovery', 'plan']

['suburb', 'ground']

['return', 'contract','driving']

on each reacord I do not understand the following :

-1st reord: why these words are removed: "'qaia', 'flags', 'amman', 'melbourne', 'jetstar'

-2ed recored: essential words are reomved same as the first reord, Also, I was expecting children to convert to child

-3ed, "driving" is not converted to "drive"

I was expecting that words will such as "Amman" will not removed, Also i am expecting the words will be converted from plural to singular. And the verbs will be converted to the infinitive ...

What i am missing here??? Thanx in advance

1

There are 1 best solutions below

3
On BEST ANSWER

I'm guessing that most of your issues are because you're not feeding spaCy full sentences and it's not assigning the correct part-of-speech tags to your words. This can cause the lemmatizer to return the wrong results. However, since you've only provided snippets of code and none of the original text, it's difficult to answer this question. Next time consider boiling down your question to a few lines of code that someone else can run on their machine EXACTLY AS WRITTEN, and providing a sample input that fails. See Minimal Reproducible Example

Here's an example that works and is close to what you're doing.

import spacy
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
allow_postags = set(['NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'])
nlp = spacy.load('en')
text = 'The children in Amman and Melbourne are too young to be driving.'
words = []
for token in nlp(text):
    if token.text not in stop_words and token.pos_ in allow_postags:
        words.append(token.lemma_)
print(' '.join(words))

This returns child Amman Melbourne young drive