How to iterate on keras Dataset and edit content

578 Views Asked by At

I am working on this movie classification problem https://www.tensorflow.org/tutorials/keras/text_classification

In this example text files(12500 files with movie revies) are read and a batched dataset is prepared like below

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=seed)

at the time of standardization

def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
#I WANT TO REMOVE STOP WORDS HERE, CAN I DO
  return tf.strings.regex_replace(stripped_html,'[%s]' % re.escape(string.punctuation),'')

Problem: I understand that I have got training dataset with labels in variable 'raw_train_ds'. Now I want to iterate over this dataset and remove stop words from the movie review text and store back to same variable, I tried to do it in function 'custom_standardization' but it gives type error,

I also tried to use tf.strings.as_strings but it returns error InvalidArgumentError: Value for attr 'T' of string is not in the list of allowed values: int8, int16, int32, int64

can someone please help on it OR simply please help how to remove stopwords from the batch dataset

1

There are 1 best solutions below

0
On

It looks like right now TensorFlow does not have built in support for stop words removal, just basic standardization (lowercase & punctuation stripping). The TextVectorization used in the tutorial supports a custom standardization callback, but I couldn't find any stop words examples.

Since the tutorial downloads the imdb dataset and reads the text files from disc you can just do standardization manually with python before reading them. This will modify the text files themselves, but then you can read in the files normally using tf.keras.preprocessing.text_dataset_from_directory, and the entries will already have the stop words removed.

#!/usr/bin/env python3

import pathlib
import re

from bs4 import BeautifulSoup
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))


def cleanup_text_files_in_folder(folder_name):
    text_files = []

    for file_path in pathlib.Path(folder_name).glob('*.txt'):
        text_files.append(str(file_path))

    print(f'Found {len(text_files)} files in {folder_name}')

    # Give some kind of status
    i = 0
    for text_file in text_files:
        replace_file_contents(text_file)
        i += 1

        if i % 1000 == 0:
            print("No of files processed =", i)

    return text_files


def replace_file_contents(input_file):
    """
    This will read in the contents of the text file, process it (clean up, remove stop words)
    and overwrite the new 'processed' output to that same file

    """
    with open(input_file, 'r') as file:
        file_data = file.read()

    file_data = process_text_adv(file_data)

    with open(input_file, 'w') as file:
        file.write(file_data)


def process_text_adv(text):
    # review without HTML tags
    text = BeautifulSoup(text, features="html.parser").get_text()

    # review without punctuation and numbers
    text = re.sub(r'[^\w\s]','',text, re.UNICODE)

    # lowercase
    text = text.lower()

    # simple split
    text = text.split()

    swords = set(stopwords.words("english"))  # conversion into set for fast searching
    text = [w for w in text if w not in swords]

    # joining of splitted paragraph by spaces and return
    return " ".join(text)


if __name__ == "__main__":
    # Download & untar dataset beforehand, then running this would modify the text files
    # in place. Back up the originals if that's a concern.
    cleanup_text_files_in_folder('aclImdb/train/pos/')
    cleanup_text_files_in_folder('aclImdb/train/neg/')
    cleanup_text_files_in_folder('aclImdb/test/pos/')
    cleanup_text_files_in_folder('aclImdb/test/neg/')