Converting from PyTorch to Tensorflow for Self-Attention Pooling Layer

1k Views Asked by At

I have found an implementation of the said layer from this paper, "Self-Attention Encoding and Pooling for Speaker Recognition", available at here via Pytorch. However, due to CUDA compatibility issues, I can't want to use the said code. Also, thus far, all my codes have been implemented in Tensorflow. So, I want to do a one-to-one translation/conversion or whatever, from PyTorch to Tensorflow.

First of all, this is the code in PyTorch:

class SelfAttentionPooling(nn.Module):
    def __init__(self, input_dim):
        super(SelfAttentionPooling, self).__init__()
        self.W = nn.Linear(input_dim, 1)
    
    def forward(self, batch_rep):
        """
        input:
            batch_rep : size (N, T, H), N: batch size, T: sequence length, H: Hidden dimension
      
        attention_weight:
            att_w : size (N, T, 1)
    
        return:
            utter_rep: size (N, H)
        """
        softmax = nn.functional.softmax
        att_w = softmax(self.W(batch_rep).squeeze(-1)).unsqueeze(-1)
        utter_rep = torch.sum(batch_rep * att_w, dim=1)

        return utter_rep

And this is my translation of the snippet code to Tensorflow:

class Self_Attention_Pooling(keras.layers.Layer): ?
    def __init__(self, input_dim):
        super(Self_Attention_Pooling, self).__init__()

        self.W = Dense(input_dim)

    def forward(self, batch_rep):
        softmax = Softmax()
        att_w = self.W(batch_rep)
        att_w = softmax(att_w)
        
        # Not so sure about these two lines though.
        #x = np.expand(batch_rep)
        #att_w = softmax(self.W(x))

        utter_rep = np.sum(batch_rep * att_w, axis=1)

        return utter_rep

Is my implementation/translation/conversion from PyTorch to Tensorflow correct? If not, please edit and help me.

Thank you very much.

1

There are 1 best solutions below

0
M. Perier--Dulhoste On

2 remarks regarding your implementation:

  • For custom layers in TF, you should implement the call method instead of the forward method cf Implementing custom layers.
  • For the operations you should replace the numpy functions by tensorflow functions to enable GPU support.

Here is the code I am using in TF for the SelfAttentionPooling:

import tensorflow as tf

class SelfAttentionPooling(tf.keras.layers.Layer):
    
    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)
        self.dense = tf.keras.layers.Dense(units=1, use_bias=False)
    
    def call(self, x: tf.Tensor) -> tf.Tensor:
        """Apply the self attention pooling on input tensor.
        
        Args:
            x: input tensor (?, seq_len, emb_dim)
        
        Returns:
            (?, emb_dim)
        """
        # (?, seq_len)
        attention_weights = tf.nn.softmax(tf.squeeze(self.dense(x)))
        
        # (?, emb_dim)
        pooled = tf.reduce_sum(tf.expand_dims(attention_weights, axis=-1) * x, axis=1)

        return pooled

You can quickly check it gives the expected output:

self_attn_pooling = SelfAttentionPooling()
# (?, seq_len, emb_dim)
input_shape = 4, 9, 128
x = tf.random.normal(input_shape)

pooled = self_attn_pooling(x)

# (?, emb_dim)
assert pooled.shape == (4, 128)