Is there a way to tokenize sentences with Longformer?

174 Views Asked by John Fields At 28 July 2025 at 02:30

I have forked the Multimodal Transformers package and created a new version with Longformer support here --> https://github.com/jtfields/Multimodal-Toolkit-Longformer/tree/master. Georgian.io maintains the Multimodal Transformers package and here are their comments on the error message I'm receiving:

"Hey @jtfields, I haven't had a chance to look at your code but judging by the error, it sounds like longformers might have an additional step required. Specifically, your forward() method returns a different than expected shape.

Looking at this in particular: RuntimeError: stack expects each tensor to be equal size, but got [32, 2, 768] at entry 0 and [32, 43] at entry 1

It looks like your outputs are of the shape (batch_size, sequence_length, embedding_dim). This corresponds to having an embedding for every word in the output I.E., word embeddings. However, what we want is a sentence embedding where we have one embedding for every sentence (or paragraph). So instead, the shape we want is (batch_size, embedding_dim).

Unfortunately there's no ready answer I have on how to get that. Different models have different best practices. BERT-based models use the embedding of the [CLS] token to get sentence embeddings, while others such as XLM use an additional layer to do this task (see the sequence_summary bits in multimodal_transformers/model/tabular_transformers.py). I'm not familiar with longformers so I can't tell you exactly what to do, but I'm sure that there's a standard method people use for it."

Does anyone have any suggestions for how to change the tokenization to work with Longformer in this package?

I'm not sure if changing the code below in the notebook is the best approach or changing the code in tabular_transformers.py:

tokenizer_path_or_name = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
print('Specified tokenizer: ', tokenizer_path_or_name)
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_path_or_name,
    cache_dir=model_args.cache_dir,
)

Original Q&A

Is there a way to tokenize sentences with Longformer?

There are 0 best solutions below

Related Questions in NLP

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in TEXT-CLASSIFICATION

Related Questions in HUGGINGFACE-TOKENIZERS

Related Questions in MULTIMODAL

Trending Questions

Popular # Hahtags

Popular Questions