I would like to make predictions for some categorical variables using random forests in TensorFlow / Keras. I would expect that the output should be a vector of probabilities, and it is the case if there are at least three possible output values. Surprisingly, the answer seems to be a single number if the set of possible values consists of just two elements.
My code would be easier if I do not have to treat such special cases separately, so here is my question: why TensorFlow treats this in a special way and is there some way to treat both cases in a uniform way?
Below you can find the minimal example.
import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd
def train_and_predict(multi, values):
train_pd=pd.DataFrame( [ {"key":v, "value":v} for _ in range(multi) for v in values ] )
train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd, label= "value")
rf = tfdf.keras.RandomForestModel()
rf.fit(x=train_tf)
to_guess =pd.DataFrame( [ {"key":v} for v in values] )
guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
return rf.predict(guess_tf )
print(train_and_predict(500, ["a", "b"] ) )
print(train_and_predict(500, ["a", "b", "c"] ) )
For the input
print(train_and_predict(500, [3, 5] )
we get, as expected, a vector of probabilities:
[[0. 0. 0. 0.99999917 0. 0. ]
[0. 0. 0. 0. 0. 0.99999917]]
Unfortunately, for categorical variables
print(train_and_predict(500, ["a", "b"] )
we get a single number as the answer
[[0. ]
[0.99999917]]
while if there are at least three possible values:
print(train_and_predict(500, ["a", "b", "c"] )
we get a nice list of probabilities:
[[0.99999917 0. 0. ]
[0. 0.99999917 0. ]
[0. 0. 0.99999917]]
I use TensorFlow and Keras version '2.11.0', on Kaggle.
Short answer: If your classification problem (with string labels) just has two values in the label (i.e. binary classification), TF-DF only outputs the probability p of the positive label, i.e. the one with larger lexicographical order. The probability for the other label can be computed with 1-p.
Details:
String Labels: Keras does not support string labels natively - for Keras, Labels have to be (positive) integers. Since TF-DF is used through the Keras API, the function
tfdf.keras.pd_dataframe_to_tf_dataset()
converts the strings in the label column to integers by sorting them and assigning labels 0,1,..., n-1 where n is the number of unique values in the label column.For n=2, the problem is recognized as a binary classification problem and TF-DF outputs only the probability of the positive class (the one mapped to 1), that is, the second string according to Python's sorting of the string values.
Integer labels: If your labels are already integers,
tfdf.keras.pd_dataframe_to_tf_dataset()
does not modify them. TF-DF also recognizes integer labels as "already integerized" and does not apply any mapping to save space / complexity. Instead, it assumes that the possible labels are 0,1, ..., max_label, where max_label is the value of the largest label in the label column. The output vector therefore has max_label+1 dimensions. If your labels are [0,1], you will also see that only the probability of the label 1 is returned.Full Disclosure: I'm one of the authors of Tensorflow Decision Forests.