Ordinal Target Variable Prediction in Python

140 Views Asked by At

I am trying to put together an ML pipeline in Python (using Sklearn, open to alternative package suggestions) where I have 5 categorical feature variables, 2 continuous feature variables, and an ordinal target variable with the following value counts:

0.0    35063
1.0     1073
2.0      496
3.0       52
4.0       13
5.0        4
6.0        2

As you might have already caught, the trick here is that approximately ~95% of the target variable is comprised of a 0.0 label. I have put together a pipeline where I am OneHotEncoding the categorical feature variables and StandardScaling the continuous feature variables.

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', continuous_transformer, continuous_features)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

And later applying the following split:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Using Sklearn metrics' accuracy_score functionality, it appears that I achieve a 94% overall model accuracy which is great. However I am worried that due to the skew in the target variable, this model becomes prone to fitting problems. I would really appreciate some insight here.

Thanks all!

1

There are 1 best solutions below

0
On BEST ANSWER

Consider the following points:

  • Class imbalance: A naïve classifier that always predicts the majority class would be correct 95.5% of the time. Therefore, if your classifier shows an accuracy of 94%, it may not be performing better than a naïve approach. Explore methods to manage the target class imbalance, such as undersampling or oversampling.

  • Classifier for an ordinal target: The RandomForestClassifier does not account for the ordinal nature of the target variable. For algorithms better suited to ordinal targets, refer to this discussion: Multi-class, multi-label, ordinal classification with sklearn

  • Metric: As indicated, accuracy_score may not be the optimal metric for your scenario. A high accuracy_score does not guarantee a useful classifier. Furthermore, it disregards the ordinal nature of your target variable. For example, an accuracy_score treats predicting 0.0 instead of 6.0 the same as predicting 5.0 instead of 6.0. Investigate metrics that more accurately reflect the cost of misclassifying an ordinal target: Measures of ordinal classification error for ordinal regression

  • Splitting: As desired when dealing with imbalanced data, train_test_split by default splits in a stratified way. This means that each split will contain roughly the same proportion of each class in each split. When implementing cross-validation, make sure to use a stratified approach as well, for example with StratifiedKFold.

With a little bit of research on these points I am certain you will find a good solution to your problem.