trouble performing stemming and count vectorizer

49 Views Asked by At

I am trying to perform a stemming and count vectorizer on the disaster tweet from Kaggle (https://www.kaggle.com/datasets/vstepanenko/disaster-tweets/data). I dropped the keyword, location, and target columns. I got an error when I run this code,FileNotFoundError: [Errno 2] No such file or directory: 'id'. How do I fix this?

from nltk.stem.porter import PorterStemmer

STEMMER=PorterStemmer()


# Use NLTK's PorterStemmer in a function
def MY_STEMMER(str_input):
    words = re.sub(r"[^A-Za-z\-]", " ", str_input).lower().split()
    words = [STEMMER.stem(word) for word in words]
    return words

## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer(input="filename", 
                        #stop_words='english', 
                        tokenizer=MY_STEMMER,
                        lowercase=True)

## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(tweet)
## get col names
ColNames=MyCV1.get_feature_names_out()
print(ColNames)

## convert DTM to DF

MyDF1 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF1)
0

There are 0 best solutions below