Understanding and using incremental regression with catboost

53 Views Asked by At

I saw the example (Catboost training model for huge data(~22GB) with multiple chunks) for classification with catboost and tried to adapt it to for incremental multiple regression but I keep spinning my wheels with it generating different errors that are not clear.

from catboost import CatBoostRegressor
from catboost import Pool
import pandas as pd
from sklearn.model_selection import train_test_split

clf = CatBoostRegressor(task_type="CPU",
                     iterations=1000,
                     loss_function='MultiRMSE',
                     learning_rate=0.5,
                     max_depth=7)
chunk=pd.read_csv('./train2/DataTableTrain.tsv',sep='\t',chunksize=10000000)
for i,ds in enumerate(chunk):
    X = Pool(ds, column_description='./train-7-7.cd', delimiter='\t');
    if i==0:
        clf.fit(X, column_description='./train-7-7.cd')
    else:
        clf.fit(X, init_model='model.bin',
                column_description='./train-7-7.cd')
    clf.save_model('model.bin')         # save model so is loaded in the next step
    del X

The column description file looks as follows:

0       Label
1       Label
2       Label
3       Label
4       Label
5       Label
6       Label
7       Num
8       Num
9       Num
10      Num
11      Num
12      Num
13      Num

The data file is 14 tab separated string representations of numbers per line.

Currently it is complaining about 'data should be the string or pathlib.Path type if column_description parameter is specified.' Can someone please direct me to what the software wants or provide a sample python application that can incrementally process a file for multiple regression.

Also, can you explain what incremental means in the context of training? For examples that I have tried successfully, the model seems to just have a new model tacked on to the end for each chunk

0

There are 0 best solutions below