Is there a way to tokenize sentences with Longformer?

179 Views Asked by At

I have forked the Multimodal Transformers package and created a new version with Longformer support here --> https://github.com/jtfields/Multimodal-Toolkit-Longformer/tree/master. Georgian.io maintains the Multimodal Transformers package and here are their comments on the error message I'm receiving:

"Hey @jtfields, I haven't had a chance to look at your code but judging by the error, it sounds like longformers might have an additional step required. Specifically, your forward() method returns a different than expected shape.

Looking at this in particular: RuntimeError: stack expects each tensor to be equal size, but got [32, 2, 768] at entry 0 and [32, 43] at entry 1

It looks like your outputs are of the shape (batch_size, sequence_length, embedding_dim). This corresponds to having an embedding for every word in the output I.E., word embeddings. However, what we want is a sentence embedding where we have one embedding for every sentence (or paragraph). So instead, the shape we want is (batch_size, embedding_dim).

Unfortunately there's no ready answer I have on how to get that. Different models have different best practices. BERT-based models use the embedding of the [CLS] token to get sentence embeddings, while others such as XLM use an additional layer to do this task (see the sequence_summary bits in multimodal_transformers/model/tabular_transformers.py). I'm not familiar with longformers so I can't tell you exactly what to do, but I'm sure that there's a standard method people use for it."

Does anyone have any suggestions for how to change the tokenization to work with Longformer in this package?

I'm not sure if changing the code below in the notebook is the best approach or changing the code in tabular_transformers.py:

tokenizer_path_or_name = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
print('Specified tokenizer: ', tokenizer_path_or_name)
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_path_or_name,
    cache_dir=model_args.cache_dir,
)
0

There are 0 best solutions below