From my understanding, all four of these methods:
predict, predict_on_batch, predict_step, and a direct forward pass through the model (e.g. model(x, training=False) or __call__())
should all give the same results, some are just more efficient than others in how they handle batches of data versus one sample.
But I am actually getting different results on an image super-resolution (upscaling) task I'm working on:
for lowres, _ in val.take(1):
# Get a randomly cropped region of the lowres image for upscaling
lowres = tf.image.random_crop(lowres, (150, 150, 3)) # uint8
# Need to add a dummy batch dimension for the predict step
model_inputs = tf.expand_dims(lowres, axis=0) # (1, 150, 150, 3), uint8
# And convert the uint8 image values to float32 for input to the model
model_inputs = tf.cast(model_inputs, tf.float32) # float32
preds = model.predict_on_batch(model_inputs)
min_val = tf.reduce_min(preds).numpy()
max_val = tf.reduce_max(preds).numpy()
print("Min value: ", min_val)
print("Max value: ", max_val)
preds = model.predict(model_inputs)
min_val = tf.reduce_min(preds).numpy()
max_val = tf.reduce_max(preds).numpy()
print("Min value: ", min_val)
print("Max value: ", max_val)
preds = model.predict_step(model_inputs)
min_val = tf.reduce_min(preds).numpy()
max_val = tf.reduce_max(preds).numpy()
print("Min value: ", min_val)
print("Max value: ", max_val)
preds = model(model_inputs, training=False) # __call__()
min_val = tf.reduce_min(preds).numpy()
max_val = tf.reduce_max(preds).numpy()
print("Min value: ", min_val)
print("Max value: ", max_val)
Prints:
Min value: -6003.622
Max value: 5802.6826
Min value: -6003.622
Max value: 5802.6826
Min value: -53.7696
Max value: 315.1499
Min value: -53.7696
Max value: 315.1499
Both predict_step and __call__() give the "correct" answers as defined by the upscaled images look correct.
I'm happy to share more details on the model if that's helpful, but for now I thought I'd just leave it at this to not overcomplicate the question. At first I wondered if these methods had different results based on training/inference modes, but my model doesn't use any BatchNorm or Dropout layers, so that shouldn't make a difference here. It's completely composed of: Conv2D, Add, tf.nn.depth_to_space (pixel shuffle), and Rescaling layers. That's it. It also doesn't use any subclassing or override any methods, just uses keras.Model(inputs, outputs).
Any ideas why these prediction methods would give different answers?
EDIT: I've been able to create a minimally reproducible example where you can see the issue. Please see: https://www.kaggle.com/code/quackaddict7/really-minimum-reproducible-example
I initially couldn't reproduce the problem in a minimal example. I eventually added back in a dataset, batching, data augmentation, training, model file saving/restoring, and eventually discovered the issue is GPU vs. CPU! So I took all that back out for my minimal example. If you run the notebook attached you'll see that on CPU, all four methods give the same answer with randomly initialized weights. But if you change to P100 GPU, predict/predict_on_batch differ from predict_step/forward pass (__call__).
So I guess at this point, my question is, why are CPU vs. GPU results different here?
I have tested the given sample code in
tf.keras==2.12.0and found a possible bug in the API and it fails only on GPU. In your sample code, the mismatch occurred due to thereluactivation. If we set anything else, i.e.seluoreluor evenleaky_relu, they would work as expected.In order to keep using
relumethod, following fix can be adopted for the moment.Here is the full code for references.
Model
Inference
Logits Checking