I am working on this movie classification problem https://www.tensorflow.org/tutorials/keras/text_classification
In this example text files(12500 files with movie revies) are read and a batched dataset is prepared like below
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
at the time of standardization
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
#I WANT TO REMOVE STOP WORDS HERE, CAN I DO
return tf.strings.regex_replace(stripped_html,'[%s]' % re.escape(string.punctuation),'')
Problem: I understand that I have got training dataset with labels in variable 'raw_train_ds'. Now I want to iterate over this dataset and remove stop words from the movie review text and store back to same variable, I tried to do it in function 'custom_standardization' but it gives type error,
I also tried to use tf.strings.as_strings
but it returns error
InvalidArgumentError: Value for attr 'T' of string is not in the list of allowed values: int8, int16, int32, int64
can someone please help on it OR simply please help how to remove stopwords from the batch dataset
It looks like right now TensorFlow does not have built in support for stop words removal, just basic standardization (lowercase & punctuation stripping). The TextVectorization used in the tutorial supports a custom standardization callback, but I couldn't find any stop words examples.
Since the tutorial downloads the imdb dataset and reads the text files from disc you can just do standardization manually with python before reading them. This will modify the text files themselves, but then you can read in the files normally using tf.keras.preprocessing.text_dataset_from_directory, and the entries will already have the stop words removed.