Categorical features with Catboost Spark

415 Views Asked by At

I am currently coding with catboost_spark version 1.0.3.

When I try to run a regression I get this error message:

Internal CatBoost Error (contact developers for assistance): attempt to interpret non-binary feature as binary

My code looks like this:

First I define the metadata that I'll integrate into my model:

meta = {'ml_attr': {'attrs': {'nominal': [{'idx': 0, 'name': 'categorical_feature', 'ord': False, 'vals': ['4', '6']}]}, 'num_attrs': 1}}

Then I define my spark dataframe, by initializing it as follows:

my_array = (
 Row(Vectors.dense('6'), "1000.0"),
 Row(Vectors.dense('4'), "10004.0"),
 Row(Vectors.dense('6'), "10005.0")
)

my_df = spark.createDataFrame(spark.sparkContext.parallelize(my_array))\
    .withColumnRenamed("_1", "features")\
    .withColumnRenamed("_2", "label")\
    .withColumn("features", F.col("features").alias("features", metadata=meta))

trainPool = Pool(my_df)

Once the dataframe has been created, I define the parameters of the CatBoostRegressor and fit it on the trainPool:

params = {
    'borderCount': 150,
    'depth': 3,
    'l2LeafReg': 2.8,
    'learningRate': 0.08,
    'rsm': 0.8,
    'lossFunction': 'Tweedie:variance_power=1.95',
    'evalMetric': 'Tweedie:variance_power=1.95',
    'scoreFunction': EScoreFunction(6), # 'L2',
    'iterations': 5000,
    'loggingLevel': ELoggingLevel(0), # 'Silent',
    'odType': EOverfittingDetectorType(3), # 'Iter',
    'odWait': 40,
    'threadCount': 7,
    'useBestModel': True,
    'allowWritingFiles': False,
    }

model_test_test = CatBoostRegressor(**params)
model_test_test = model_test_test.fit(trainPool)

It's here that I get the error. It seems that this error occurs when fitting the CatBoostRegressor to some data with a categorical feature containing values not included in the metadata. I already did some test, and it's not a problem due to the type of the data.

0

There are 0 best solutions below