I'm trying to do some preprocessing on my dataset. Specifically, I'm trying to remove paywall language from the text (in bold below) but I keep getting an empty string as my output.
Here is the sample text:
In order to put a stop to the invasive bush honeysuckle or Lonicera Maackii currently taking over forests in Missouri and Kansas, according to Debbie Neff of Excelsior Springs has organized an… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription.
and my custom function:
import re
import string
import nltk
from nltk.corpus import stopwords
# function to detect paywall-related text
def detect_paywall(text):
paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]
for keyword in paywall_keywords:
if re.search(r'\b{}\b'.format(keyword), text, flags=re.IGNORECASE):
return True
return False
# function for text preprocessing
def preprocess_text(text):
# Check if the text contains paywall-related content
if detect_paywall(text):
# Remove paywall-related sentences or language from the text
sentences = nltk.sent_tokenize(text)
cleaned_sentences = [sentence for sentence in sentences if not detect_paywall(sentence)]
cleaned_text = ' '.join(cleaned_sentences)
return cleaned_text.strip() # Remove leading/trailing whitespace
# Tokenization
tokens = nltk.word_tokenize(text)
# Convert to lowercase
tokens = [token.lower() for token in tokens]
# Remove punctuation
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in stripped if word.isalpha() and word not in stop_words]
return ' '.join(words)
I've tried modifying the list of words to detect but to no avail. However, I found that removing "subscribers" from the list does remove the second sentence of the paywall language. But that's not really ideal because there still remains the other half.
The function is also inconsistent because it works on this piece of text (as it will remove the paywall language), but not the one above.
Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. is part of that percentage. The Richmond junior joined that group by winning… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription.
This method avoids
forloops by:textintophrases(the list of sentences),filtercontaining allkeywordsat once,textwithout the sentences found to contain at least one of thekeywords.In the current state, this method ignores bold formating, and uses the simple
str.split()instead of the regexre.split()or nltk, which is why it fails to split at the '...' three-dot single character.With input:
Form pattern for filter:
Split text by sentences:
Find hits:
Eliminate those and reform the text:
References: