Why Keras / TensorFlow do not see input features in random forest models if the dataset is very small?

236 Views Asked by At

I am trying to use Keras and TensorFlow to predict a variable via random forests. I encountered an unexpected behavior and I managed to trace it back to the following issue. If my training dataset is too small, I get the warning The model does not have any input features i.e. the model is constant and will always return the same prediction. even though there is a feature in the dataset. Is it a bug, or maybe a deeply undocumented feature?

Below is a minimal non-working example. The training dataset just says that the key 1 is associated always with the value 1 and the key 2 is associated with the value 2. This information is encoded "multiplicity" number of times.

The correct behaviour should be that whenever we get the key "1" as the input, the probability that "0" is the correct answer is equal to 0.0, the probability that "1" is the correct answer is equal to 1.0, while the probability that "2" is the correct answer is equal to 0.0. This means that my desired answer is the vector of probabilities (0.0, 1.0, 0.0). If "2" is the key, the desired answer should be (0.0, 0.0, 1.0)].

The real output of the program is as follows: if the multiplicity is at most 4, TensorFlow does not see any input features; if the multiplicity is 5 or more, TensorFlow can see 1 feature. This change of behaviour may indicate a bug.

Also, the output of the prediction seems very strange, for example for multiplicity=5 we get really crazy probabilities: for "1" we get [0., 0.61333287, 0.3866664 ] and for "2" we get [0., 0.35999975, 0.6399995 ].

import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd


def train_and_predict(multiplicity):
    train_pd=pd.DataFrame( multiplicity *[ {"key":1, "value":1}] + multiplicity *[ {"key":2, "value":2} ] )
    train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(train_pd,  label= "value")

    rf = tfdf.keras.RandomForestModel()
    rf.fit(x=train_tf)

    to_guess =pd.DataFrame( [ {"key":1}, {"key":2} ] )
    guess_tf = tfdf.keras.pd_dataframe_to_tf_dataset(to_guess)
    return rf.predict(guess_tf )


print(train_and_predict(4))
print(train_and_predict(5))

The interesting parts of the output:

[WARNING] The model does not have any input features i.e. the model is constant and will always return the same prediction.
[INFO] Model loaded with 300 root(s), 300 node(s), and 0 input feature(s).
[INFO] Engine "RandomForestGeneric" built
[INFO] Use fast generic engine

[[0.         0.6033329  0.39666638]
 [0.         0.6033329  0.39666638]]


[INFO] Model loaded with 300 root(s), 452 node(s), and 1 input feature(s).
[INFO] Use fast generic engine.

[[0.         0.61333287 0.3866664 ]
 [0.         0.35999975 0.6399995 ]]

I use TensorFlow version 2.11.0 on Kaggle. Can you help me figuring out whether the problem lies on a software bug or rather on me not understanding something?

1

There are 1 best solutions below

0
On BEST ANSWER

There are 3 question in this, let me answer them one by one.

  1. By default, Tensorflow Decision Forests will not create any nodes with less than 5 examples in the node. If you just have 8 examples (multiplicity = 4), you cannot split these examples to get 2 nodes with at least 5 examples, so no split is applied and the model is constant. You can control this hyperparameter by setting to 1, e.g. rf = tfdf.keras.RandomForestModel(min_examples=1).

  2. The probability vector is 3-dimensional because the labels are integers 1 and 2, so TF-DF implicitly assumes that all integers in the range [0,1,2] are valid labels. This is done mostly to avoid having to compute a mapping from the real label to actual label - you can control this process in detail through the dataspec of the underlying C++ library.

  3. The reason you're getting "weird" probabilities is due to the definition of random forests. Random Forests (RF) are based on bagging techniques. Each tree is trained on a different dataset, sampled randomly (with replacement) from the original one. For you, this means that Tree 1 is (always) trained on the full dataset, which gives to perfect 0-1-probabilities you probably expected. For all other trees, those are sampled on a sample of the dataset that may not have a good split (see Part 1 of the answer), and therefore the tree will just predict the priority class. When averaging over all trees, you end up with the probabilities you're seeing.

It can be interesting to plot the individual trees in order to get a feel for this with

# Text representation of all trees
print(rf.make_inspector().extract_all_trees())
# In Colab / IPython, we have interactive plots for individual trees
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=77)

TF-DF allows you to disable bagging by setting rf = tfdf.keras.RandomForestModel(bootstrap_training_dataset=False), but doing so completely destroys one of the main ideas of Random Forests. You can also just create a single tree rf = tfdf.keras.RandomForestModel(num_trees=1), as the first tree does not use bagging.

Note: Generally, RFs also use feature bagging, i.e. sampling a random subset of attributes, but it is not used since the dataset only has one feature.

Full Disclosure: I'm one of the authors of Tensorflow Decision Forests.