NLP data maintain frequency of word

62 Views Asked by At

I am cleaning corpus using following code:-

token=['hi','hi','account','is','follow' ,'follow','account','delhi']
to_remove=set(words union of stopwrold, city,country,firstname, lastname, otherword)
set(token)-to_remove
{'account','follow',}

Because of taking set of token loosing frequency of repeated world, causing low performance of tf-idf. I want to maintain frequency of output world. I have large corpus and using for loop for manual removal takes weeek in cleaning, above code complete job in 1:30 hrs.

output I want in fastest possible way:

{'account','follow' ,'follow','account'}
4

There are 4 best solutions below

1
M Junaid On

try this hopefully this will help you

from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

to_remove = {'stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'}

filtered_token = [word for word in token if word not in to_remove]

# To maintain frequency, you can use Counter to count occurrences and reconstruct the list
counts = Counter(token)
output = [word for word in filtered_token for _ in range(counts[word])]

print(output)
0
Ugochukwu Obinna On
from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

to_remove = set(['stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'])

# Create a dictionary to store word counts
word_counts = Counter(token)

filtered_token = [word for word in token if word not in to_remove]

output_token = []
for word in filtered_token:
    output_token.extend([word] * word_counts[word])

print(output_token)
0
J i N On

I see, if you're looking to maintain the frequency without using list comprehension for cleaning, you can filter the words first and then reconstruct the list while maintaining the frequency. You can achieve this by counting the occurrences of the words in the original list and reconstructing the output list accordingly.

from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

to_remove = {'stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'}

filtered_token = [word for word in token if word not in to_remove]

# Count occurrences of words in the original list
counts = Counter(token)

# Reconstruct the output list while maintaining frequency
output = []
for word in filtered_token:
    output.extend([word] * counts[word])

print(output)
0
Andj On

The OP's requirement to retain the frequency of the repeat words not filtered out, is problematic. It is easy to retain repeated words during filtering either through list comprehension, or making a deep copy of the list ans looping through the removal set to remove all instances from the list. Although the second approach hasn't been answered in existing answers.

But what I want to look at is the issue of retaining the frequency of tokens:

There are two approaches:

  1. relative frequency of each retained token relative to other retained words. Existing answers provide solutions to this.
  2. relative frequency of each retained token relative to its frequency in the original list of tokens, i.e. take into account teh words filtered out.

In the example in the OP: the list of tokens after filtering would be ['account','follow' ,'follow','account'] each token has a frequency of 0.5. This value would satisfy the first interpretation of relative frequency, i.e. equal frequencies between the two tokens. But frequencies of these two tokens in original list of tokens would be 0.25. If we want to retain this specific frequency value we would use the second interpretation.

We wnat to filter out certain tokens, but also retain the notion that there was a token that was removed. One approach is to replace tokens we don't want if a marker indicating its presence. In this example I will use the token <UNK>:

from collections import Counter

tokens=['hi','hi','account','is','follow' ,'follow','account','delhi']

counts = Counter(tokens)
print(counts)
# Counter({'hi': 2, 'account': 2, 'follow': 2, 'is': 1, 'delhi': 1})

# Sets of tokens to remove
stopwords = {'a', 'an', 'is', 'the', 'am', 'hi'}
city = {'london', 'phnom penh', 'beijing', 'paris', 'delhi'}
country = {'india', 'new zealand', 'bhutan', 'laos'}
firstname = {'santosh', 'sanjay', 'sunil', 'khanh', 'lan'}
lastname = {'chen', 'wu', 'zhao', 'laurent', 'moreau'}
otherword = {'vortex'}

# Create a union of the above sets
to_remove = stopwords.union(city, country, firstname, lastname, otherword)

# Filter tokens
filtered_tokens = tokens[:]
for i in range(len(filtered_tokens)):
    filtered_tokens[i] = "<UNK>" if filtered_tokens[i] in to_remove else filtered_tokens[i]
print(filtered_tokens)
# ['<UNK>', '<UNK>', 'account', '<UNK>', 'follow', 'follow', 'account', '<UNK>']

# Count filtered tokens
filtered_counts = Counter(filtered_tokens)
print(filtered_counts)
# Counter({'<UNK>': 4, 'account': 2, 'follow': 2})

# Normalised frequencies
def normalised_counter(counter):
    total_word_count = sum(counter.values(), 0.0)
    for key in counter:
        counter[key] /= total_word_count
    return counter

print(normalised_counter(counts))
# Counter({'hi': 0.25, 'account': 0.25, 'follow': 0.25, 'is': 0.125, 'delhi': 0.125})

print(normalised_counter(filtered_counts))
# Counter({'<UNK>': 0.5, 'account': 0.25, 'follow': 0.25})

When doing further processing using filtered_tokens, it is possible to remove the <UNK> token or leave it in place depending on requirements of the task.

Addendum:

If you want to create a filtered list to use with the first interpretation, but wnat to use list.remove() instead of list comprehension:

filtered_token = token[:]
for item in list(to_remove):
    try:
        while item in filtered_token:
            filtered_token.remove(item)
    except ValueError:
        pass