I am currently coding with catboost_spark
version 1.0.3.
When I try to run a regression I get this error message:
Internal CatBoost Error (contact developers for assistance): attempt to interpret non-binary feature as binary
My code looks like this:
First I define the metadata that I'll integrate into my model:
meta = {'ml_attr': {'attrs': {'nominal': [{'idx': 0, 'name': 'categorical_feature', 'ord': False, 'vals': ['4', '6']}]}, 'num_attrs': 1}}
Then I define my spark dataframe, by initializing it as follows:
my_array = (
Row(Vectors.dense('6'), "1000.0"),
Row(Vectors.dense('4'), "10004.0"),
Row(Vectors.dense('6'), "10005.0")
)
my_df = spark.createDataFrame(spark.sparkContext.parallelize(my_array))\
.withColumnRenamed("_1", "features")\
.withColumnRenamed("_2", "label")\
.withColumn("features", F.col("features").alias("features", metadata=meta))
trainPool = Pool(my_df)
Once the dataframe has been created, I define the parameters of the CatBoostRegressor
and fit it on the trainPool
:
params = {
'borderCount': 150,
'depth': 3,
'l2LeafReg': 2.8,
'learningRate': 0.08,
'rsm': 0.8,
'lossFunction': 'Tweedie:variance_power=1.95',
'evalMetric': 'Tweedie:variance_power=1.95',
'scoreFunction': EScoreFunction(6), # 'L2',
'iterations': 5000,
'loggingLevel': ELoggingLevel(0), # 'Silent',
'odType': EOverfittingDetectorType(3), # 'Iter',
'odWait': 40,
'threadCount': 7,
'useBestModel': True,
'allowWritingFiles': False,
}
model_test_test = CatBoostRegressor(**params)
model_test_test = model_test_test.fit(trainPool)
It's here that I get the error. It seems that this error occurs when fitting the CatBoostRegressor
to some data with a categorical feature containing values not included in the metadata. I already did some test, and it's not a problem due to the type of the data.