This is my code:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder, MaxAbsScaler
from sklearn.metrics import precision_recall_fscore_support
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix, hstack
import os
sgd_classifier = SGDClassifier(loss='log', penalty='elasticnet', max_iter=30, n_jobs=60, alpha=1e-6, l1_ratio=0.7, class_weight='balanced', random_state=0)
vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(4,4), min_df=10)
X_train = vectorizer.fit_transform(X_text_train.ravel())
X_test = vectorizer.transform(X_text_test.ravel())
print('TF-IDF number of features:', len(vectorizer.get_feature_names()))
scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print('Inputs shape:', X_train.shape)
sgd_classifier.fit(X_train, y_train)
y_predicted = sgd_classifier.predict(X_test)
y_predicted_prob = sgd_classifier.predict_proba(X_test)
results_report = classification_report(y_test, y_predicted, labels=classes_trained, digits=2, output_dict=True)
df_results_report = pd.DataFrame.from_dict(results_report)
pd.set_option('display.max_rows', 300)
print(df_results_report.transpose())
X_text_train & X_text_test has shape (2M, 2) and (100k, 2) respectively.
They first column is about the description of financial transactions and generally speaking each description consists of 5-15 words; so each line contains about 5-15 words. The second column is a categorical variable which just has the name of the bank related to this bank transaction.
I merge these two columns in one description so now X_text_train & X_text_test has shape (2M, ) and (100k, ) respectively.
Then I apply TF-IDF and now X_text_train & X_text_test has shape (2M, 50k) and (100k, 50k) respectively.
What I observe is that when there is an unseen value on the second column (so a new bank name in the merged description) then the SGDClassifier returns some very different and quite random predictions than what it would return if I had entirely dropped the second column with the bank names.
The same occurs if I do the TF-IDF only on the descriptions and keep the bank names separately as a categorical variable.
Why this happens with SGDClassifier
?
Is it that SGD in general cannot handle well at all unseen values because of the fact that it converges in this stochastic way ?
The interesting thing is that on TF-IDF the vocabulary is predetermined so unseen values in the test set are basically not taken into account at all in the features (ie all the respective features just have 0 as a value) but still the SGD breaks.
(I posted also this on skLearn's Github https://github.com/scikit-learn/scikit-learn/issues/21906)
This I do not understand: in scikit-learn, text vectorizers are not expected to accept 2D inputs. They expect an iterable of
str
objects:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit
So it's not possible for
X_text_train
to have a shape other than(n_documents,)
.This does not make any sense to me:
np.array([["a", "b"], ["c", "d"]], dtype=object).ravel()
will returnarray(['a', 'b', 'c', 'd'], dtype=object)
. So this would generate 2 rows per original row inX_text_train
.Maybe you wanted to do something like the following?
It's not really possible to answer your question precisely without having access to a minimal reproducible example with either minimal synthetic data or publicly available data.
You can answer the question by yourself by replacing
SGDClassifier
byLogisticRegression
that uses the LBFGS solver that is non-stochastic.