I have been trying to implement a CTC loss function in keras for several days now.
Unfortunately, I have yet to find a simple way to do this that fits well with keras. I found tensorflow's tf.keras.backend.ctc_batch_cost function but there is not much documentation on it. I am confused about a few things. First, what are the input_length and label_length parameters? I am trying to make a handwriting recognition model and my images are 32x128, my RNN has 32 time steps, and my character list has a length of 80. I have tried to use 32 for both parameters and this gives me the error below.
Shouldn't the function already know the input_length and label_length from the shape of the first two parameters (y_true and y_pred)?
Secondly, do I need to encode my training data? Is this all done automatically?
I know tensorflow also has a function called tf.keras.backend.ctc_decode. Is this only used when making predictions?
def ctc_cost(y_true, y_pred):
return tf.keras.backend.ctc_batch_cost(
y_true, y_pred, 32, 32)
model = tf.keras.Sequential([
layers.Conv2D(32, 5, padding="SAME", input_shape=(32, 128, 1)),
layers.BatchNormalization(),
layers.Activation("relu"),
layers.MaxPool2D(2, 2),
layers.Conv2D(64, 5, padding="SAME"),
layers.BatchNormalization(),
layers.Activation("relu"),
layers.MaxPool2D(2, 2),
layers.Conv2D(128, 3, padding="SAME"),
layers.BatchNormalization(),
layers.Activation("relu"),
layers.MaxPool2D((1, 2), (1, 2)),
layers.Conv2D(128, 3, padding="SAME"),
layers.BatchNormalization(),
layers.Activation("relu"),
layers.MaxPool2D((1, 2), (1, 2)),
layers.Conv2D(256, 3, padding="SAME"),
layers.BatchNormalization(),
layers.Activation("relu"),
layers.MaxPool2D((1, 2), (1, 2)),
layers.Reshape((32, 256)),
layers.Bidirectional(layers.LSTM(256, return_sequences=True)),
layers.Bidirectional(layers.LSTM(256, return_sequences=True)),
layers.Reshape((-1, 32, 512)),
layers.Conv2D(80, 1, padding="SAME"),
layers.Softmax(-1)
])
print(model.summary())
model.compile(tf.optimizers.RMSprop(0.001), ctc_cost)
Error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: squeeze_dims[0] not in [0,0). for 'loss/softmax_loss/Squeeze' (op: 'Squeeze') with input shapes: []
Model:
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 128, 32) 832
batch_normalization (BatchNo (None, 32, 128, 32) 128
activation (Activation) (None, 32, 128, 32) 0
max_pooling2d (MaxPooling2D) (None, 16, 64, 32) 0
conv2d_1 (Conv2D) (None, 16, 64, 64) 51264
batch_normalization_1 (Batch (None, 16, 64, 64) 256
activation_1 (Activation) (None, 16, 64, 64) 0
max_pooling2d_1 (MaxPooling2 (None, 8, 32, 64) 0
conv2d_2 (Conv2D) (None, 8, 32, 128) 73856
batch_normalization_2 (Batch (None, 8, 32, 128) 512
activation_2 (Activation) (None, 8, 32, 128) 0
max_pooling2d_2 (MaxPooling2 (None, 8, 16, 128) 0
conv2d_3 (Conv2D) (None, 8, 16, 128) 147584
batch_normalization_3 (Batch (None, 8, 16, 128) 512
activation_3 (Activation) (None, 8, 16, 128) 0
max_pooling2d_3 (MaxPooling2 (None, 8, 8, 128) 0
conv2d_4 (Conv2D) (None, 8, 8, 256) 295168
batch_normalization_4 (Batch (None, 8, 8, 256) 1024
activation_4 (Activation) (None, 8, 8, 256) 0
max_pooling2d_4 (MaxPooling2 (None, 8, 4, 256) 0
reshape (Reshape) (None, 32, 256) 0
bidirectional (Bidirectional (None, 32, 512) 1050624
bidirectional_1 (Bidirection (None, 32, 512) 1574912
reshape_1 (Reshape) (None, None, 32, 512) 0
conv2d_5 (Conv2D) (None, None, 32, 80) 41040
softmax (Softmax) (None, None, 32, 80) 0
Here is the tensorflow documentation I was referencing:
https://www.tensorflow.org/api_docs/python/tf/keras/backend/ctc_batch_cost
input_lengthis the length of the input sequence in time steps.label_lengthis the length of the text label.For example, if you are trying to recognize:
and you are doing it in 32 time steps, then your
input_lengthis 32 and yourlabel_lengthis 12 (len("John Hancock")).You usually process input data in batches, which have to be padded to the largest element in the batch, so this information is lost. In your case the
input_lengthis always the same, but thelabel_lengthvaries.When dealing with speech recognition, for example,
input_lengthcan vary as well.Not sure I understand what you are asking, but here is a good example written in Keras:
https://keras.io/examples/image_ocr/
In general, yes. You can also try to use it make you breakfast in the morning, but it's not very good at it ;)