What is the activation layer used for TensorFlow text classification example

269 Views Asked by At

I am trying to understand the TensorFlow text classification example at https://www.tensorflow.org/tutorials/keras/text_classification. They define the model as follows:

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1)])

To the best of my knowledge, deep learning models use an activation function and I wonder what activation function the above classification model uses internally. Can anyone help me understand that?

2

There are 2 best solutions below

4
On BEST ANSWER

As you read, the model definition is written something like this

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1)])

And the data set used in that tutorials is a binary classification zero and one. By not defining any activation to the last layer of the model, the original author wants to get the logits rather than probability. And that why they used the loss function as

model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              ... 

Now, if we set the last layer activation as sigmoid (which usually pick for binary classification), then we must set from_logits=False. So, here are two option to chose from:

with logit: True

We take the logit from the last layer and that why we set from_logits=True.

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1, activation=None)])

model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(
    train_ds, verbose=2,
    validation_data=val_ds,
    epochs=epochs)
7ms/step - loss: 0.6828 - accuracy: 0.5054 - val_loss: 0.6148 - val_accuracy: 0.5452
Epoch 2/3
7ms/step - loss: 0.5797 - accuracy: 0.6153 - val_loss: 0.4976 - val_accuracy: 0.7406
Epoch 3/3
7ms/step - loss: 0.4664 - accuracy: 0.7734 - val_loss: 0.4197 - val_accuracy: 0.8096

without logit: False

And here we take the probability from the last layer and that why we set from_logits=False.

model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1, activation='sigmoid')])

model.compile(loss=losses.BinaryCrossentropy(from_logits=False),
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(
    train_ds, verbose=2,
    validation_data=val_ds,
    epochs=epochs)
Epoch 1/3
8ms/step - loss: 0.6818 - accuracy: 0.6163 - val_loss: 0.6135 - val_accuracy: 0.7736
Epoch 2/3
7ms/step - loss: 0.5787 - accuracy: 0.7871 - val_loss: 0.4973 - val_accuracy: 0.8226
Epoch 3/3
8ms/step - loss: 0.4650 - accuracy: 0.8365 - val_loss: 0.4195 - val_accuracy: 0.8472

Now, you may wonder, why this tutorial uses logit (or no activation to the last layer)? The short answer is, it generally doesn't matter, we can choose any option. The thing is, there is a chance of numerical instability in the case of using from_logits=False. Check this answer for more details.

0
On

This model uses a single activation function at the output (a sigmoid), used for predictions for a binary classification task.

The task to perform often guides the choice of both loss and activation functions. In this case, therefore, the Binary-Cross-Entropy loss function is used, as well as the sigmoid activation function (which is also called the logistic function, and outputs values between 0 and 1 for any real value taken as input). This is quite well explained in this post.

In contrast, you can also have multiple activation functions in a neural network, depending on its architecture; it is very common for instance in convolutional neural networks to have an activation function for each convolutional layer, as shown in this tutorial.