I am quite new to deep learning but I am working on this little binary text classification experiment. I want to investigate what impact the training data size has on the metrics of the model (does a bigger dataset automatically show better accuracy etc.). So I wrote this code using bag of words and logistic regression. I know it is probably not the best but I am actually happy how it turned out and that it even works and shows different metric scores depending on the data size :) However while looking through my code again (because I have to write a short paper about it) I am now confused how the model pipeline and the vectorizer gets any training data to build the vocabulary for the bag of words on. It is my first time using the pipeline method and before I always indicated the training data (train_text) to fit_transform the vectorizer. I think I got it to how it is now by looking through lots of different examples and tutorials. So my question is where does the model pipeline get the data for the bag of words features from?
# Store the file paths in a list so we can go over the list once
# instead of reading in each file on its own while repeating the same code
# the number indicates the number of instances that are used for training
# all includes 3244 instances
data_files = ["data/ToxicTrainData_100.csv",
"data/ToxicTrainData_250.csv",
"data/ToxicTrainData_500.csv",
"data/ToxicTrainData_1000.csv",
"data/ToxicTrainData_1500.csv",
"data/ToxicTrainData_2000.csv",
"data/ToxicTrainData_2500.csv",
"data/ToxicTrainData_3000.csv",
"data/ToxicTrainData_all.csv"]
# Read in the training data from the list and concatenate (verbinden) them into a single DataFrame
dfs = []
for file_path in data_files:
df = pd.read_csv(file_path, encoding="latin1") # using latin1 encoding for German characters
df["comment_text"] = df["comment_text"].str.lower() # convert text to lowercase
dfs.append(df)
train_df = pd.concat(dfs, ignore_index=True)
# Read in the test data
test_df = pd.read_csv("data/test.csv", encoding="latin1") # using latin1 encoding for German characters
test_df["comment_text"] = test_df["comment_text"].str.lower() # convert text to lowercase
# Extract the data and labels from the test files
test_text = test_df["comment_text"].values
test_labels = test_df["Sub1_Toxic"].values
# Use a pipeline for data preprocessing and model training
# Used here to chain multiple machine learning steps (bag of words and the
# logistic regression model) into a single object that can be
# used for training and prediction, and to
# simplify the code and make it more readable
# Use the CountVectorizer for the Bag of Words feature and logistic
# regression as the classifier model
model_pipeline = Pipeline([
("vectorizer", CountVectorizer(max_features=5000, lowercase=True)), #Vocabulary size of 5000
("classifier", LogisticRegression(random_state=0, solver="liblinear", C=0.1)) #small C value for more regularization to prevent overfitting
])
for file_path in data_files:
# Read in the data
train_df = pd.read_csv(file_path, encoding="latin1")
# Extract the data and labels from the train and test files
train_text = train_df["comment_text"].values
train_labels = train_df["Sub1_Toxic"].values
# Evaluate the model using 5-fold cross-validation
kfold = KFold(n_splits=5, random_state=0, shuffle=True)
cv_scores = cross_val_score(model_pipeline, train_text, train_labels, cv=kfold)
print("\nCross-validation scores:", cv_scores)
print("Mean cross-validation score:", cv_scores.mean())
# Fit the model
model_pipeline.fit(train_text, train_labels)
# Evaluate the model on the testing data
test_predictions = model_pipeline.predict(test_text)
test_accuracy = accuracy_score(test_labels, test_predictions)
test_precision = precision_score(test_labels, test_predictions)
test_recall = recall_score(test_labels, test_predictions)
test_f1 = f1_score(test_labels, test_predictions)
# Print the evaluation metrics for each file
print("\nEvaluation for file:", file_path)
print("Testing accuracy:", test_accuracy)
print("Testing precision:", test_precision)
print("Testing recall:", test_recall)
print("Testing F1 score:", test_f1)
I used the coef_attribute of the logistic regression model to get the feature importance which correspond to the weights that the model assigns to each feature when making predicitons and I got a list of words back (example: gez 1.328956411398795, dumm 1.4224345261402154, framing 1.427425145503989). So I know it gets the words from somewhere but I do not understand from where.
At a high level the way your pipeline works is that the count vectorizer transforms the words/sentences into vectors. The logistic regression then intakes that vector and produces a prediction/classification.
Your pipeline gets the data for the bag of words from the
train_textvariable. The features for the bag of words is generated by theCountVectorizerwhen you call fit on the pipeline.You can read more about the documentation for the CountVectorizer fit method here.