I have a set of features to build labelling functions (set A) and another set of features to train a sklearn classifier (set B)

The generative model will output a set of probabilisitic labels which i can use to train my classifier.

Do i need to add in the features (set A) that i used for the labelling functions into my classifier features? (set B) Or just use the labels generated to train my classifier?

I was referencing the snorkel spam tutorial and i did not see them use the features in the labelling function set to train a new classifier.

As seem in cell 47, featurization is done entirely using a CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())

X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

And then straight to fitting a keras model:

# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])

keras_model.fit(
    x=X_train,
    y=probs_train_filtered,
    validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
    callbacks=[get_keras_early_stopping()],
    epochs=50,
    verbose=0,
)
2

There are 2 best solutions below

0
On BEST ANSWER

I asked the same question to the snorkel github page and this is the response :

you do not need to add in the features (set A) that you used for LFs into the classifier features. In order to prevent the end model from simply overfitting to the labeling functions, it is better if the features for the LFs and end model (set A and set B) are as different as possible

https://github.com/snorkel-team/snorkel-tutorials/issues/193#issuecomment-576450705

3
On

From your linked snorkel tutorial, the labeling functions (which maps input to labels ("HAM", "SPAM", "Abstain") are used to provide labels instead of features.

IIUC, the idea is to generate labels when you do not have good quality human labels. Though these "auto-generated" labels would be quite noisy, it could be served as a starting point of a labeled dataset. The learning process is to take this dataset and learn a model, which encodes the knowledge embedded in these labeling functions. Hopefully the model could be more general and the model could be applied to unseen data.

If some of these labeling function (can be considered as fixed rules instead) are very stable (regarding prediction accuracy) in certain conditions, given enough training data, your model should be able to learn that. However, in production system, to overcome the possibility of model instability, one easy fix is to override machine prediction with human labels on seen data. The same idea can be applied too if you think these labeling functions could be used for some specific input (pattern). In this case, the labeling functions would be used to directly get labels to override machine predictions. This process can be implemented as a pre-check before your machine-learned model runs.