I’m trying to preprocess a data frame with two columns. Each cell contains a string, called "title" and "body".
Based on this article I tried to reproduce the preprocessing. However, there is clearly something I am not getting right, and it’s the order to process this or that, and have the correct type that each function expects. I keep getting errors of type list as no attribute str
, or type str as no attribute str
and so on.
Here is what I have done:
def lemmatize_pos_tagged_text(text, lemmatizer, post_tag_dict):
sentences = nltk.sent_tokenize(text)
new_sentences = []
for sentence in sentences:
sentence = sentence.lower()
new_sentence_words = []
pos_tuples = nltk.pos_tag(nltk.word_tokenize(sentence))
for word_idx, word in enumerate(nltk.word_tokenize(sentence)):
nltk_word_pos = pos_tuples[word_idx][1]
wordnet_word_pos = post_tag_dict.get(nltk_word_pos[0].upper(), None)
if wordnet_word_pos is not None:
new_word = lemmatizer.lemmatize(word, wordnet_word_pos)
else:
new_word = lemmatizer.lemmatize(word)
new_sentence_words.append(new_word)
new_sentence = " ".join(new_sentence_words)
new_sentences.append(new_sentence)
return " ".join(new_sentences)
def processing_steps(df):
lemmatizer = WordNetLemmatizer()
pos_tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
local_stopwords = set(stopwords.words('english'))
additional_stopwords = ["http", "u", "get", "like", "let", "nan"]
words_to_keep = ["i'" " i ", "me", "my", "we", "our", "us"]
local_stopwords.update(additional_stopwords)
for word in words_to_keep:
if word in local_stopwords:
words_to_keep.remove(word)
for column in df.columns:
# Tokenization
df[column] = df[column].apply(lambda x: word_tokenize(x))
# Lowercasing each word within the list
df[column] = df[column].apply(lambda x: [word.lower() for word in x])
# Removing stopwords
df[column] = df[column].apply(lambda tokens: [word for word in tokens if word.isalpha() and word not in local_stopwords])
# Replace diacritics
df[column] = df[column].apply(lambda x: unidecode(x, errors="preserve"))
# Expand contractions
df[column] = df[column].apply(lambda x: " ".join([contractions.fix(expanded_word) for expanded_word in x.split()]))
# Remove numbers
df[column] = df[column].apply(lambda x: re.sub(r'\d+', '', x))
# Typos correction
df[column] = df[column].apply(lambda x: str(TextBlob(x).correct()))
# Remove punctuation except period
df[column] = df[column].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation.replace('.', '')), '' , x))
# Remove double space
df[column] = df[column].apply(lambda x: re.sub(' +', ' ', x))
# Lemmatization
df[column] = df[column].apply(lambda x: lemmatize_pos_tagged_text(x, lemmatizer, pos_tag_dict))
return df
As an example, that’s the error message I get with the current state of the function. But keep in mind that whenever I try to change things, like commenting out the part for splitting, I would get another error of type
, or attribute
. So the question really is: what’s the proper order? How to handle the fact that different function need different types for processing the same element?:
49
50 # Expand contractions
---> 51 df[column] = df[column].apply(lambda x: " ".join([contractions.fix(expanded_word) for expanded_word in x.split()]))
52
53 # Remove numbers
AttributeError: 'list' object has no attribute 'split'
Any conceptual explanation is very welcome!
I got it, the issue was with the in the first few lines of the
processing_steps()
.I am tokenizing the elements, thus making it a list of words, and then passing that to functions not expecting a list, but rather a string.
So I just had to add in the list comprehension to iterate through each list of each cell by adding
… for word in x
. Here is the completed, with some other adjustments as well: