Removing stopwords also removes spaces between words during frequency distribution

68 Views Asked by At

I am looking to remove stopwords from text to optimise my frequency distribution results

My initial frequency distribution code is written:

# Determine the frequency distribution 
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(review_comments)
fdist = FreqDist(tokens)
fdist

This returns

FreqDist({"'": 521, ',': 494, "'the": 22, 'a': 16, "'of": 16, "'is": 12, "'to": 10, "'for": 9, "'it": 8, "'that": 8, ...})

I want to remove the stopwords with the following code

# Delete all the alpanum.
# Filter out tokens that are neither alphabets nor numbers (to eliminate punctuation marks, etc.).
filtered = [word for word in review_comments if word.isalnum()]

# Remove all the stopwords
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords

# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))

# Create a filtered list of tokens without stopwords.
filtered2 = [x for x in filtered if x.lower() not in english_stopwords]

# Define an empty string variable.
filtered2_string = ''

for value in filtered:
    # Add each filtered token word to the string.
    filtered2_string = filtered2_string + value + ''
    

Now I run the fdist again

from nltk.tokenize import word_tokenize
trial= nltk.word_tokenize(filtered2_string)
fdist1 = FreqDist(trial)
fdist1

This returns the code

FreqDist({'whenitcomestoadmsscreenthespaceonthescreenitselfisatanabsolutepremiumthefactthat50ofthisspaceiswastedonartandnotterriblyinformativeorneededartaswellmakesitcompletelyuselesstheonlyreasonthatigaveit2starsandnot1wasthattechnicallyspeakingitcanatleaststillstanduptoblockyournotesanddicerollsotherthanthatitdropstheballcompletelyanopenlettertogaleforce9yourunpaintedminiaturesareverynotbadyourspellcardsaregreatyourboardgamesaremehyourdmscreenshoweverarefreakingterribleimstillwaitingforasinglescreenthatisntpolluted': 1})

review_comments = ''
for i in range(newdf.shape[1]):
    # Add each comment.
    review_comments = review_comments + newdf['tokens1'][i]```


How do I get the stopwords to not remove the spaces and count the words individually?




I removed the stopwords and rerun the frequency distribution hoping to get the most frequent words.
2

There are 2 best solutions below

0
Mankind_2000 On BEST ANSWER

Cleaning in NLP tasks is generally performed on tokens rather than characters of a string to leverage the inbuild functionalities/ methods. However, you can always do this from scratch using your own logic on characters as well, if you need to. The stopwords in nltk are in the form of tokens to use for clean up of your text corpus. You can add more tokens that you need to eliminate from your list. For e.g. if you need the english stopwords and punctuations removed, do something like:

import string
from nltk.tokenize import word_tokenize

tokens = word_tokenize(review_comments)

## Add any additional punctuations/ words you want to eliminate here, like below
english_stop_plus_punct = set(stopwords.words('english') + ["call"] + 
                          list(string.punctuation + "“”’"))

filtered2 = [x for x in tokens if x.lower() not in english_stop_plus_punct]

fdist1 = nltk.FreqDist(filtered2)
fdist1

#### FreqDist({'presence': 3, 'meaning': 2, 'might': 2, 'Many': 1, 'psychologists': 1, 'knowing': 1, 'life': 1, 'drive': 1, 'look': 1, ...})

Example text from a write up on "meaning of life":

review_comments = """ Many psychologists call knowing your life’s meaning “presence,” and the drive to look for it “search.” They are not mutually exclusive: You might or might not search, whether you already have a sense of meaning or not. Some people low in presence don’t bother searching—they are “stuck.” Some are high in presence but keep searching—we can call them “seekers.” """
0
yashaswi k On

your code is tokenizing characters rather than words , here is updated code with sample input data

nltk.download ('stopwords')
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import stopwords

review_comments="the quick brown fox jumps over the lazy dog !"

tokens = word_tokenize(review_comments)
print("tokens are",tokens)

word_freq = Counter(tokens)
print("freq",word_freq)

filtered = [word for word in tokens if word.isalnum()]
print("after alnum removal",filtered)

english_stopwords = set(stopwords.words('english'))

filtered2 = [x for x in filtered if x.lower() not in english_stopwords]
print("after stopwords removal",filtered2)

filtered2_string = ' '.join(filtered2)

print(filtered2_string)

tokens = word_tokenize(filtered2_string)
print("tokens are",tokens)

word_freq = Counter(tokens)
print("freq",word_freq)

output :

tokens are ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '!']
freq Counter({'the': 2, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 1, '!': 1})
after alnum removal ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
after stopwords removal ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
quick brown fox jumps lazy dog
tokens are ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
freq Counter({'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1})