Where does the model pipeline get the data for the bag of words features from?

Question

Where does the model pipeline get the data for the bag of words features from?

77 Views Asked by clowny At 25 April 2023 at 20:35

I am quite new to deep learning but I am working on this little binary text classification experiment. I want to investigate what impact the training data size has on the metrics of the model (does a bigger dataset automatically show better accuracy etc.). So I wrote this code using bag of words and logistic regression. I know it is probably not the best but I am actually happy how it turned out and that it even works and shows different metric scores depending on the data size :) However while looking through my code again (because I have to write a short paper about it) I am now confused how the model pipeline and the vectorizer gets any training data to build the vocabulary for the bag of words on. It is my first time using the pipeline method and before I always indicated the training data (train_text) to fit_transform the vectorizer. I think I got it to how it is now by looking through lots of different examples and tutorials. So my question is where does the model pipeline get the data for the bag of words features from?

# Store the file paths in a list so we can go over the list once
# instead of reading in each file on its own while repeating the same code
# the number indicates the number of instances that are used for training
# all includes 3244 instances
data_files = ["data/ToxicTrainData_100.csv",
              "data/ToxicTrainData_250.csv",
              "data/ToxicTrainData_500.csv",
              "data/ToxicTrainData_1000.csv",
              "data/ToxicTrainData_1500.csv",
              "data/ToxicTrainData_2000.csv",
              "data/ToxicTrainData_2500.csv",
              "data/ToxicTrainData_3000.csv",
              "data/ToxicTrainData_all.csv"]

# Read in the training data from the list and concatenate (verbinden) them into a single DataFrame
dfs = []
for file_path in data_files:
    df = pd.read_csv(file_path, encoding="latin1")       # using latin1 encoding for German characters
    df["comment_text"] = df["comment_text"].str.lower()  # convert text to lowercase
    dfs.append(df)
train_df = pd.concat(dfs, ignore_index=True)

# Read in the test data
test_df = pd.read_csv("data/test.csv", encoding="latin1")      # using latin1 encoding for German characters
test_df["comment_text"] = test_df["comment_text"].str.lower()  # convert text to lowercase


# Extract the data and labels from the test files
test_text = test_df["comment_text"].values
test_labels = test_df["Sub1_Toxic"].values


# Use a pipeline for data preprocessing and model training
# Used here to chain multiple machine learning steps (bag of words and the
# logistic regression model) into a single object that can be
# used for training and prediction, and to
# simplify the code and make it more readable
# Use the CountVectorizer for the Bag of Words feature and logistic
# regression as the classifier model
model_pipeline = Pipeline([
    ("vectorizer", CountVectorizer(max_features=5000, lowercase=True)),      #Vocabulary size of 5000
    ("classifier", LogisticRegression(random_state=0, solver="liblinear", C=0.1)) #small C value for more regularization to prevent overfitting
])

for file_path in data_files:
    # Read in the data
    train_df = pd.read_csv(file_path, encoding="latin1")

    # Extract the data and labels from the train and test files
    train_text = train_df["comment_text"].values
    train_labels = train_df["Sub1_Toxic"].values

    # Evaluate the model using 5-fold cross-validation
    kfold = KFold(n_splits=5, random_state=0, shuffle=True)
    cv_scores = cross_val_score(model_pipeline, train_text, train_labels, cv=kfold)
    print("\nCross-validation scores:", cv_scores)
    print("Mean cross-validation score:", cv_scores.mean())

    # Fit the model
    model_pipeline.fit(train_text, train_labels)

    # Evaluate the model on the testing data
    test_predictions = model_pipeline.predict(test_text)
    test_accuracy = accuracy_score(test_labels, test_predictions)
    test_precision = precision_score(test_labels, test_predictions)
    test_recall = recall_score(test_labels, test_predictions)
    test_f1 = f1_score(test_labels, test_predictions)

    # Print the evaluation metrics for each file
    print("\nEvaluation for file:", file_path)
    print("Testing accuracy:", test_accuracy)
    print("Testing precision:", test_precision)
    print("Testing recall:", test_recall)
    print("Testing F1 score:", test_f1)

I used the coef_attribute of the logistic regression model to get the feature importance which correspond to the weights that the model assigns to each feature when making predicitons and I got a list of words back (example: gez 1.328956411398795, dumm 1.4224345261402154, framing 1.427425145503989). So I know it gets the words from somewhere but I do not understand from where.

Original Q&A

There are 1 best solutions below

**tcglezen** · Answer 1 · 2023-04-25T20:54:43.450000

At a high level the way your pipeline works is that the count vectorizer transforms the words/sentences into vectors. The logistic regression then intakes that vector and produces a prediction/classification.

Your pipeline gets the data for the bag of words from the train_text variable. The features for the bag of words is generated by the CountVectorizer when you call fit on the pipeline.

model_pipeline = Pipeline([
    ("vectorizer", CountVectorizer(max_features=5000, lowercase=True)),      #Vocabulary size of 5000
    ("classifier", LogisticRegression(random_state=0, solver="liblinear", C=0.1)) #small C value for more regularization to prevent overfitting
])

You can read more about the documentation for the CountVectorizer fit method here.

Where does the model pipeline get the data for the bag of words features from?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in DEEP-LEARNING

Related Questions in PIPELINE

Related Questions in LOGISTIC-REGRESSION

Related Questions in COUNTVECTORIZER

Trending Questions

Popular # Hahtags

Popular Questions