Why does LightGBM NOT overfit on the train set when there are more parameters than samples?

Question

Why does LightGBM NOT overfit on the train set when there are more parameters than samples?

261 Views Asked by alonhzn At 12 September 2023 at 14:18

I have a dataset for binary classification, to make things basic, I only use one feature.

I made sure that feature is unique per sample (no sample has the same value for this feature, therefore it is possible to build a tree that classifies each sample to the correct label on the train set)
I set the regularizations to zero: 'lambda_l1':0, 'lambda_l2':0.1, 'min_gain_to_split':0
I have 267 samples (Number of positive: 98, number of negative: 169)
I set the tree size to be much bigger: 'num_leaves': 8, 'max_depth': 20, 'max_bin':500
I allow planty of time to train: num_boost_round=5000

But I still get AUC 0.869 on the Train set. What am I missing?

I tried to play with all available parameters, and I can't get the Train set AUC to 1.0 My full params Dictionary is: params = { 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'AUC', 'num_leaves': 8, 'max_depth': 20, 'learning_rate': 0.001, 'feature_fraction': 1, 'bagging_fraction': 1, 'bagging_freq': 0, 'verbose': 1, 'is_unbalance':'true', 'max_bin':500, 'min_data_in_leaf':1, 'lambda_l1':0, 'lambda_l2':0.1, 'min_gain_to_split':0 }

EDIT: with 3 features I am able to overfit to the train set. However, that doesn't change the original question. The decision tree should have overfitted on a single feature easily.

Original Q&A

There are 1 best solutions below

**Golden Lion** · Answer 1 · 2023-09-13T21:59:42.123000

I was able to achieve 77% on yelp data predicting popular or non popular binary. use GridSearchCV to find the best parameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 100, 'num_leaves': 500}

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer 
from keras_preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from lightgbm import LGBMClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import log_loss, make_scorer
from sklearn.model_selection import GridSearchCV

    #df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv')
df=pd.read_csv('yelp.csv',parse_dates=['date']).dropna()
df.drop(['business_id','type','review_id','user_id'],axis=1,inplace=True)
df.set_index('date',inplace=True)

df["popular"] = df.apply(lambda x: 1 if (x["stars"] >=3 and (x["funny"] >2 or x["cool"] >1 or x["useful"] > 1)) else 0, axis=1)
#print(df.head())

data=df["text"]
tokenizer=Tokenizer()
tokenizer.fit_on_texts(data)
tokenizer.fit_on_texts(data)

X=tokenizer.texts_to_sequences(data)
max_length = df["text"].str.len().max()

#print(max_length)

X=pad_sequences(X,maxlen=max_length)
target=df["popular"]
y=np.array(target)

#Print a histogram
plt.hist(y, bins=2)
plt.annotate('popular', xy=(0.75, 2), xytext=(0.75, 2.5), ha='center', va='bottom', color='white')
plt.annotate('non popular', xy=(0.25, 2), xytext=(0.25, 2.5), ha='center', va='bottom', color='white')
plt.show()


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

param_grid = {
    'max_depth': [7, 10, 15, 16],
    'num_leaves': [8, 16, 32, 500],
    'learning_rate': [.1, .2, .3, .4],
    'n_estimators': [100, 200, 300, 500],
    'max_bin':[100,200,400,500],
}

params = { 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'AUC', 'num_leaves': 8, 'max_depth':
          20, 'learning_rate': 0.001, }


scores=dict()
lgb_clf = LGBMClassifier(objective='binary',
        boosting_type='gbdt',
        #max_depth=max_depth[i],
        #num_leaves=num_leaves[i],
        #learning_rate=learning_rate[i],
        metric='binary_logloss',
        num_class=1,
        n_jobs=1,
        #n_estimators =n_estimators[i]
        )

grid_search = GridSearchCV(estimator=lgb_clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

y_pred = grid_search.predict(X_test)
print("accuracy:" ,accuracy_score(y_pred, y_test))
cm=confusion_matrix(y_pred, y_test)
sns.heatmap(cm,annot=True,cmap=plt.cm.Blues)
plt.show()
    
print(scores)

Why does LightGBM NOT overfit on the train set when there are more parameters than samples?

There are 1 best solutions below

Related Questions in LIGHTGBM

Related Questions in OVERFITTING-UNDERFITTING

Trending Questions

Popular # Hahtags

Popular Questions