Why does LightGBM NOT overfit on the train set when there are more parameters than samples?

261 Views Asked by At

I have a dataset for binary classification, to make things basic, I only use one feature.

  • I made sure that feature is unique per sample (no sample has the same value for this feature, therefore it is possible to build a tree that classifies each sample to the correct label on the train set)
  • I set the regularizations to zero: 'lambda_l1':0, 'lambda_l2':0.1, 'min_gain_to_split':0
  • I have 267 samples (Number of positive: 98, number of negative: 169)
  • I set the tree size to be much bigger: 'num_leaves': 8, 'max_depth': 20, 'max_bin':500
  • I allow planty of time to train: num_boost_round=5000

But I still get AUC 0.869 on the Train set. What am I missing?

I tried to play with all available parameters, and I can't get the Train set AUC to 1.0 My full params Dictionary is: params = { 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'AUC', 'num_leaves': 8, 'max_depth': 20, 'learning_rate': 0.001, 'feature_fraction': 1, 'bagging_fraction': 1, 'bagging_freq': 0, 'verbose': 1, 'is_unbalance':'true', 'max_bin':500, 'min_data_in_leaf':1, 'lambda_l1':0, 'lambda_l2':0.1, 'min_gain_to_split':0 }

EDIT: with 3 features I am able to overfit to the train set. However, that doesn't change the original question. The decision tree should have overfitted on a single feature easily.

1

There are 1 best solutions below

0
Golden Lion On

I was able to achieve 77% on yelp data predicting popular or non popular binary. use GridSearchCV to find the best parameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 100, 'num_leaves': 500}

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer 
from keras_preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from lightgbm import LGBMClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import log_loss, make_scorer
from sklearn.model_selection import GridSearchCV

    #df=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv')
df=pd.read_csv('yelp.csv',parse_dates=['date']).dropna()
df.drop(['business_id','type','review_id','user_id'],axis=1,inplace=True)
df.set_index('date',inplace=True)

df["popular"] = df.apply(lambda x: 1 if (x["stars"] >=3 and (x["funny"] >2 or x["cool"] >1 or x["useful"] > 1)) else 0, axis=1)
#print(df.head())

data=df["text"]
tokenizer=Tokenizer()
tokenizer.fit_on_texts(data)
tokenizer.fit_on_texts(data)

X=tokenizer.texts_to_sequences(data)
max_length = df["text"].str.len().max()

#print(max_length)

X=pad_sequences(X,maxlen=max_length)
target=df["popular"]
y=np.array(target)

#Print a histogram
plt.hist(y, bins=2)
plt.annotate('popular', xy=(0.75, 2), xytext=(0.75, 2.5), ha='center', va='bottom', color='white')
plt.annotate('non popular', xy=(0.25, 2), xytext=(0.25, 2.5), ha='center', va='bottom', color='white')
plt.show()


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

param_grid = {
    'max_depth': [7, 10, 15, 16],
    'num_leaves': [8, 16, 32, 500],
    'learning_rate': [.1, .2, .3, .4],
    'n_estimators': [100, 200, 300, 500],
    'max_bin':[100,200,400,500],
}

params = { 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'AUC', 'num_leaves': 8, 'max_depth':
          20, 'learning_rate': 0.001, }


scores=dict()
lgb_clf = LGBMClassifier(objective='binary',
        boosting_type='gbdt',
        #max_depth=max_depth[i],
        #num_leaves=num_leaves[i],
        #learning_rate=learning_rate[i],
        metric='binary_logloss',
        num_class=1,
        n_jobs=1,
        #n_estimators =n_estimators[i]
        )

grid_search = GridSearchCV(estimator=lgb_clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

y_pred = grid_search.predict(X_test)
print("accuracy:" ,accuracy_score(y_pred, y_test))
cm=confusion_matrix(y_pred, y_test)
sns.heatmap(cm,annot=True,cmap=plt.cm.Blues)
plt.show()
    
print(scores)