removing paywall language from piece of text (pandas)

81 Views Asked by At

I'm trying to do some preprocessing on my dataset. Specifically, I'm trying to remove paywall language from the text (in bold below) but I keep getting an empty string as my output.

Here is the sample text:

In order to put a stop to the invasive bush honeysuckle or Lonicera Maackii currently taking over forests in Missouri and Kansas, according to Debbie Neff of Excelsior Springs has organized an… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription.

and my custom function:

import re
import string
import nltk
from nltk.corpus import stopwords

# function to detect paywall-related text
def detect_paywall(text):
    paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]
    for keyword in paywall_keywords:
        if re.search(r'\b{}\b'.format(keyword), text, flags=re.IGNORECASE):
            return True
    return False

# function for text preprocessing
def preprocess_text(text):
    # Check if the text contains paywall-related content
    if detect_paywall(text):
        # Remove paywall-related sentences or language from the text
        sentences = nltk.sent_tokenize(text)
        cleaned_sentences = [sentence for sentence in sentences if not detect_paywall(sentence)]
        cleaned_text = ' '.join(cleaned_sentences)
        return cleaned_text.strip()  # Remove leading/trailing whitespace

    # Tokenization
    tokens = nltk.word_tokenize(text)
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in stripped if word.isalpha() and word not in stop_words]
    return ' '.join(words)

I've tried modifying the list of words to detect but to no avail. However, I found that removing "subscribers" from the list does remove the second sentence of the paywall language. But that's not really ideal because there still remains the other half.

The function is also inconsistent because it works on this piece of text (as it will remove the paywall language), but not the one above.

Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. is part of that percentage. The Richmond junior joined that group by winning… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription.

1

There are 1 best solutions below

0
OCa On

This method avoids for loops by:

  • first splitting text into phrases (the list of sentences),
  • then applying a regex filter containing all keywords at once,
  • finally reconstituting text without the sentences found to contain at least one of the keywords.

In the current state, this method ignores bold formating, and uses the simple str.split() instead of the regex re.split() or nltk, which is why it fails to split at the '...' three-dot single character.

With input:

import re

text = "Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage. The Richmond junior joined that group by winning… Premium Content is available to subscribers only. Please login here to access content or go here to purchase a subscription."
paywall_keywords = ["login", "subscription", "purchase a subscription", "subscribers"]

Form pattern for filter:

patt = re.compile('|'.join(['.*' + k for k in paywall_keywords]))

'.*login|.*subscription|.*purchase a subscription|.*subscribers'

Split text by sentences:

phrases = text.split(sep='.')

['Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title',
 ' {{Elided}} is part of that percentage',
 ' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
 ' Please login here to access content or go here to purchase a subscription',
 '']

Find hits:

found = list(filter(patt.match, phrases))

[' The Richmond junior joined that group by winning… Premium Content is available to subscribers only',
 ' Please login here to access content or go here to purchase a subscription']

Eliminate those and reform the text:

'.'.join([p for p in phrases if p not in found])

'Of the hundreds of thousands of high school wrestlers, only a small percentage know what it’s like to win a state title. {{Elided}} is part of that percentage.'

References: