I am running 5-fold X-validation on a dataset using tpot from a Jupyter notebook:
scores = []
preds = []
actual_labels = []
# Initialise the 5-fold cross-validation
kf = StratifiedKFold(n_splits=5,shuffle=True)
for train_index, test_index in kf.split(X, y):
# Generate the training and test partitions of X and Y for each iteration of CV
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# TPOT is a AutoML system, that will automatically search for the best pipeline for the task
estimator = TPOTClassifier(generations=5, population_size=50, cv=5, random_state=42, verbosity=2, n_jobs=10)
#As TPOT is a AutoML system, it does its own process of tuning rather than using grid search
estimator.fit(X_train, y_train)
# Predicting the test data with the optimised models
predictions = estimator.predict(X_test)
score = metrics.f1_score(y_test, predictions)
scores.append(score)
# Extract the probabilities of predicting the 2nd class, which will use to generate the PR curve
probs = estimator.predict_proba(X_test)[:,1]
preds.extend(probs)
actual_labels.extend(y_test)
In one of the 5 runs the best pipeline is:
Best pipeline: SGDClassifier(ZeroCount(input_matrix), alpha=0.001, eta0=0.01, fit_intercept=False, l1_ratio=0.0, learning_rate=invscaling, loss=squared_hinge, penalty=elasticnet, power_t=1.0)
Because the loss is 'squared hinge', it has no predict_proba() attribute and the whole process falls over. If I was to build the classifier by hand, I understand that I'd need to change the loss to e.g. 'modified_huber', but how can I prevent tpot from falling over because of this?
