I have couple of questions:
- In a seq to seq model with varying input length, if you don't use the attention mask the RNN may end up computing the hidden state value for padded element? So thus it mean attention mask is mandatory else my output will be wrong?
- How to deal with varying length labels then, let's say I have padded for passing it in batch. Now I don't want my padded elements to have an impact on my loss, so how do I ignore that?
You can use Dynamic RNN for that. read about it here: What is a dynamic RNN in TensorFlow?