I would like to make predictions for some categorical variables using random forests in TensorFlow / Keras. I would expect that the output should be a vector of probabilities, and it is the case if there are at least three possible output values. Surprisingly, the answer seems to be a single number if the set of possible values consists of just two elements.

My code would be easier if I do not have to treat such special cases separately, so here is my question: why TensorFlow treats this in a special way and is there some way to treat both cases in a uniform way?

Below you can find the minimal example.

import tensorflow as tf
import keras
import tensorflow_decision_forests as tfdf
import pandas as pd


def train_and_predict(multi, values):
    train_pd=pd.DataFrame( [ {"key":v, "value":v} for _ in range(multi) for v in values ] )
    train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd,  label= "value")

    rf = tfdf.keras.RandomForestModel()
    rf.fit(x=train_tf)

    to_guess =pd.DataFrame( [ {"key":v} for v in values]  )
    guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
    return rf.predict(guess_tf )


print(train_and_predict(500, ["a", "b"]      ) )
print(train_and_predict(500, ["a", "b", "c"] ) )

For the input

print(train_and_predict(500, [3, 5] )

we get, as expected, a vector of probabilities:

[[0.         0.         0.         0.99999917 0.         0.        ]
 [0.         0.         0.         0.         0.         0.99999917]]

Unfortunately, for categorical variables

print(train_and_predict(500, ["a", "b"] )

we get a single number as the answer

[[0.        ]
 [0.99999917]]

while if there are at least three possible values:

print(train_and_predict(500, ["a", "b", "c"] )

we get a nice list of probabilities:

[[0.99999917 0.         0.        ]
 [0.         0.99999917 0.        ]
 [0.         0.         0.99999917]]

I use TensorFlow and Keras version '2.11.0', on Kaggle.

1

There are 1 best solutions below

0
On BEST ANSWER

Short answer: If your classification problem (with string labels) just has two values in the label (i.e. binary classification), TF-DF only outputs the probability p of the positive label, i.e. the one with larger lexicographical order. The probability for the other label can be computed with 1-p.

Details:

String Labels: Keras does not support string labels natively - for Keras, Labels have to be (positive) integers. Since TF-DF is used through the Keras API, the function tfdf.keras.pd_dataframe_to_tf_dataset() converts the strings in the label column to integers by sorting them and assigning labels 0,1,..., n-1 where n is the number of unique values in the label column.

For n=2, the problem is recognized as a binary classification problem and TF-DF outputs only the probability of the positive class (the one mapped to 1), that is, the second string according to Python's sorting of the string values.

Integer labels: If your labels are already integers, tfdf.keras.pd_dataframe_to_tf_dataset() does not modify them. TF-DF also recognizes integer labels as "already integerized" and does not apply any mapping to save space / complexity. Instead, it assumes that the possible labels are 0,1, ..., max_label, where max_label is the value of the largest label in the label column. The output vector therefore has max_label+1 dimensions. If your labels are [0,1], you will also see that only the probability of the label 1 is returned.

Full Disclosure: I'm one of the authors of Tensorflow Decision Forests.