I have a set of features to build labelling functions (set A) and another set of features to train a sklearn classifier (set B)
The generative model will output a set of probabilisitic labels which i can use to train my classifier.
Do i need to add in the features (set A) that i used for the labelling functions into my classifier features? (set B) Or just use the labels generated to train my classifier?
I was referencing the snorkel spam tutorial and i did not see them use the features in the labelling function set to train a new classifier.
As seem in cell 47
, featurization is done entirely using a CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())
And then straight to fitting a keras model:
# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])
keras_model.fit(
x=X_train,
y=probs_train_filtered,
validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
callbacks=[get_keras_early_stopping()],
epochs=50,
verbose=0,
)
I asked the same question to the snorkel github page and this is the response :
https://github.com/snorkel-team/snorkel-tutorials/issues/193#issuecomment-576450705