Train-test split with test dataset that contains no values in the target variable

79 Views Asked by At

I got a dataset like this in Python (just a small part of it, 20 features in total):

State representatives employees Score
Alabama 4 3 5
Rhode Island 7 4 2
Maryland 6 8 3
Texas 7 5 5
Florida 6 5 2

The score value is categorical, it just can take values of either 1, 2, 3, 4, or 5.

I preprocessed the data, and used LabelEncoder to encode categorical features (like for the State).

Now I want to do a train-test split like the following: All rows with a score value should be in the training set, all rows with an "NA" in the score column should be in the test dataset.

I used RandomForestClassifier to find the n most important features that I will use afterwards.

I used KNeighborsClassifier and RandomForestClassifier afterwards.

But I get pretty low values (about 0.5) for these models if I do cross validation, see here my code:

    ### 5) Check models' performances 
    clf = RandomForestClassifier(max_depth=best_max_depth, random_state=42)
    clf.fit(X_train_subset, y_train)

    knn = KNeighborsClassifier(n_neighbors=best_k_value)
    knn.fit(X_train_subset, y_train)
    
    knn_test_score = knn.score(X_test_subset, y_test)
    clf_test_score = clf.score(X_test_subset, y_test)
    
    knn_train_score = knn.score(X_train_subset, y_train)
    clf_train_score = clf.score(X_train_subset, y_train)

    print(f"TEST data - kNN Score for {n_features_to_select} selected features: {knn_test_score:.3f}")
    print(f"TEST data - RF Score for {n_features_to_select} selected features: {clf_test_score:.3f}")

    print(f"TRAINING data - kNN Score for {n_features_to_select} selected features: {knn_train_score:.3f}")
    print(f"TRAINING data - RF Score for {n_features_to_select} selected features: {clf_train_score:.3f}")
    print("*" * 70)
     
    
    # Perform Cross Validation to avoid overfitting
    # Source: https://scikit-learn.org/stable/modules/cross_validation.html
    
    # Define the number of folds for cross-validation
    # Smaller values for n_folds mean larger validation sets ("test" sets out of the training data) and smaller training sets for each iteration -> more variability in the assessment
    # Higher values for n_folds mean samller validation sets ("test" sets out of the training data) and larger training sets for each iteration -> lower variability in the assessment
    # Recommended are values between 5 and 10
    # Source: https://machinelearningmastery.com/k-fold-cross-validation/
    n_folds = 10

    # Create a k-fold cross-validation iterator
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    # Initialize models
    clf = RandomForestClassifier(max_depth=best_max_depth, random_state=42)
    knn = KNeighborsClassifier(n_neighbors=best_k_value)

    # Perform k-fold cross-validation for each model
    clf_train_scores = cross_val_score(clf, X_train_subset, y_train, cv=kf)
    knn_train_scores = cross_val_score(knn, X_train_subset, y_train, cv=kf)

    clf_test_scores = cross_val_score(clf, X_test_subset, y_test, cv=kf)
    knn_test_scores = cross_val_score(knn, X_test_subset, y_test, cv=kf)

    # Print the mean and standard deviation of the cross-validation scores
    print(f"Random Forest Classifier (RF) Cross-Validation Scores for {n_features_to_select} selected features:")
    print(f"Mean RF Score: {round(clf_train_scores.mean(), 3)}")
    print(f"Standard Deviation RF Score: {round(clf_train_scores.std(), 3)}")
    print("*" * 70)

    print(f"k-Nearest Neighbors (kNN) Cross-Validation Scores for {n_features_to_select} selected features:")
    print(f"Mean kNN Score: {round(knn_train_scores.mean(), 3)}")
    print(f"Standard Deviation kNN Score: {round(knn_train_scores.std(), 3)}")
    
    print("/" * 100)

This yields for the following number of selected features the following values (its similar for each number of selected features, which can be 1 to 20):

Results

I don't understand why there are so high values for "RF Score (train)" but not when I do cross validation.

Could you please help me what I'm doing wrong here?

Are the predictions that bad because I don't have a "true" y as y_test consists of NAs?

0

There are 0 best solutions below