How do you use tensorflow ctc_batch_cost function with keras?

3.5k Views Asked by At

I have been trying to implement a CTC loss function in keras for several days now.

Unfortunately, I have yet to find a simple way to do this that fits well with keras. I found tensorflow's tf.keras.backend.ctc_batch_cost function but there is not much documentation on it. I am confused about a few things. First, what are the input_length and label_length parameters? I am trying to make a handwriting recognition model and my images are 32x128, my RNN has 32 time steps, and my character list has a length of 80. I have tried to use 32 for both parameters and this gives me the error below.

Shouldn't the function already know the input_length and label_length from the shape of the first two parameters (y_true and y_pred)?

Secondly, do I need to encode my training data? Is this all done automatically?

I know tensorflow also has a function called tf.keras.backend.ctc_decode. Is this only used when making predictions?

def ctc_cost(y_true, y_pred):
    return tf.keras.backend.ctc_batch_cost(
        y_true, y_pred, 32, 32)


model = tf.keras.Sequential([
    layers.Conv2D(32, 5, padding="SAME", input_shape=(32, 128, 1)),
    layers.BatchNormalization(),
    layers.Activation("relu"),
    layers.MaxPool2D(2, 2),
    layers.Conv2D(64, 5, padding="SAME"),
    layers.BatchNormalization(),
    layers.Activation("relu"),
    layers.MaxPool2D(2, 2),
    layers.Conv2D(128, 3, padding="SAME"),
    layers.BatchNormalization(),
    layers.Activation("relu"),
    layers.MaxPool2D((1, 2), (1, 2)),
    layers.Conv2D(128, 3, padding="SAME"),
    layers.BatchNormalization(),
    layers.Activation("relu"),
    layers.MaxPool2D((1, 2), (1, 2)),
    layers.Conv2D(256, 3, padding="SAME"),
    layers.BatchNormalization(),
    layers.Activation("relu"),
    layers.MaxPool2D((1, 2), (1, 2)),
    layers.Reshape((32, 256)),
    layers.Bidirectional(layers.LSTM(256, return_sequences=True)),
    layers.Bidirectional(layers.LSTM(256, return_sequences=True)),
    layers.Reshape((-1, 32, 512)),
    layers.Conv2D(80, 1, padding="SAME"),
    layers.Softmax(-1)
])

print(model.summary())

model.compile(tf.optimizers.RMSprop(0.001), ctc_cost)

Error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: squeeze_dims[0] not in [0,0). for 'loss/softmax_loss/Squeeze' (op: 'Squeeze') with input shapes: []

Model:

Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 32, 128, 32)       832
batch_normalization (BatchNo (None, 32, 128, 32)       128
activation (Activation)      (None, 32, 128, 32)       0
max_pooling2d (MaxPooling2D) (None, 16, 64, 32)        0
conv2d_1 (Conv2D)            (None, 16, 64, 64)        51264
batch_normalization_1 (Batch (None, 16, 64, 64)        256
activation_1 (Activation)    (None, 16, 64, 64)        0
max_pooling2d_1 (MaxPooling2 (None, 8, 32, 64)         0
conv2d_2 (Conv2D)            (None, 8, 32, 128)        73856
batch_normalization_2 (Batch (None, 8, 32, 128)        512
activation_2 (Activation)    (None, 8, 32, 128)        0
max_pooling2d_2 (MaxPooling2 (None, 8, 16, 128)        0
conv2d_3 (Conv2D)            (None, 8, 16, 128)        147584
batch_normalization_3 (Batch (None, 8, 16, 128)        512
activation_3 (Activation)    (None, 8, 16, 128)        0
max_pooling2d_3 (MaxPooling2 (None, 8, 8, 128)         0
conv2d_4 (Conv2D)            (None, 8, 8, 256)         295168
batch_normalization_4 (Batch (None, 8, 8, 256)         1024
activation_4 (Activation)    (None, 8, 8, 256)         0
max_pooling2d_4 (MaxPooling2 (None, 8, 4, 256)         0
reshape (Reshape)            (None, 32, 256)           0
bidirectional (Bidirectional (None, 32, 512)           1050624
bidirectional_1 (Bidirection (None, 32, 512)           1574912
reshape_1 (Reshape)          (None, None, 32, 512)     0
conv2d_5 (Conv2D)            (None, None, 32, 80)      41040     
softmax (Softmax)            (None, None, 32, 80)      0

Here is the tensorflow documentation I was referencing:

https://www.tensorflow.org/api_docs/python/tf/keras/backend/ctc_batch_cost

1

There are 1 best solutions below

0
On

First, what are the input_length and label_length parameters?

input_length is the length of the input sequence in time steps. label_length is the length of the text label.

For example, if you are trying to recognize:

John Hancock

and you are doing it in 32 time steps, then your input_length is 32 and your label_length is 12 (len("John Hancock")).

Shouldn't the function already know the input_length and label_length from the shape of the first two parameters (y_true and y_pred)?

You usually process input data in batches, which have to be padded to the largest element in the batch, so this information is lost. In your case the input_length is always the same, but the label_length varies.

When dealing with speech recognition, for example, input_length can vary as well.

Secondly, do I need to encode my training data? Is this all done automatically?

Not sure I understand what you are asking, but here is a good example written in Keras:

https://keras.io/examples/image_ocr/

I know tensorflow also has a function called tf.keras.backend.ctc_decode. Is this only used when making predictions?

In general, yes. You can also try to use it make you breakfast in the morning, but it's not very good at it ;)