ada.staged_predict does not run for my specified number of trees

28 Views Asked by At

I am trying to test the performance of adaboost as I vary tree depths.

I have it looping as i change the depth of the tress. It is then supposed to go through 300 rounds of boosting. I know this might be overkill but i did the same analysis for random forests with 500 trees and 1000 trees for XGBoost. Ada was just so much slower I could not even run 500 trees. Now i get a problem even with 300 trees (boosting rounds) where it stops boosting randomly. Its not as if the performance reaches a crazy high number (F1 or R2 fluctuates around 94 sometimes). I also do not have any early stopping specified (didn't think i could even specify it with this model).

Here is my code


def growClassifier(NUMTREES: int, DEPTH: int, X: pd.DataFrame , y: np.ndarray):
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, shuffle=True)

    print(f'\nBuilding classification forest with {NUMTREES} trees each {DEPTH} deep\n')

    # Initialize the week classifier 
    base_estimator = DecisionTreeClassifier(max_depth=DEPTH)

    start_time = time.time()

    ada = AdaBoostClassifier(estimator=base_estimator, n_estimators=NUMTREES)
    ada.fit(X_train, y_train)

    elapsed_time = time.time() - start_time

    # Use staged_predict to get staged predictions
    staged_test_predictions = ada.staged_predict(X_test)
    

    f1_test, accuracy_test, precision_test, recall_test, buildtime_test = [], [], [], [], []
    # Iterate over staged predictions and evaluate performance at each stage
    for i, y_pred in enumerate(staged_test_predictions, start=1):
        # print(f'i:{i}\ny_pred:{y_pred}')
        accuracy = accuracy_score(y_test, y_pred)
        accuracy_test.append(accuracy)
        
        precision = precision_score(y_test, y_pred, average="weighted", zero_division=0)
        precision_test.append(precision)

        recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
        recall_test.append(recall)

        f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
        f1_test.append(f1)

        # print(f'accuracy:{accuracy}')

    adaClsResults = pd.DataFrame()

    numTrees, treeDepth = [], [] 
    for x in range(1, NUMTREES+1, 1):
        # print(i, x)
        numTrees.append(x) 
        treeDepth.append(DEPTH)
        buildtime_test.append(elapsed_time)

    if (i>40): 
        while (i < NUMTREES):
            accuracy_test.append(0)
            precision_test.append(0)
            recall_test.append(0)
            f1_test.append(0)
            buildtime_test[i] = 0
            i += 1


        adaClsResults['numTrees'] = numTrees
        adaClsResults['treeDepth'] = treeDepth
        adaClsResults['f1'] = f1_test
        adaClsResults['accuracy'] = accuracy_test
        adaClsResults['precision'] = precision_test
        adaClsResults['recall'] = recall_test
        adaClsResults['buildTime'] = buildtime_test
        return adaClsResults

    else: 
        print(f'\n\nfailed. Only boosted {i} times. Did not have  {NUMTREES} stages. running again \n\n')
        growClassifier(NUMTREES, DEPTH, X, y)

This was supposed to just keep running until either it hits my desired number of boosting rounds (300) or it does more than 40 rounds at which point I would autofill in 0 for the error metrics at the rounds following that up to 300.

I've also tried changing algorithm to "SAMME" and "SAMME.R" but get the same thing. Actually, with SAMME, it ends even faster.

I don't want to fill in 0's. I want it to boost 300 times.

0

There are 0 best solutions below