The last layers of longformer for document embeddings

Question

The last layers of longformer for document embeddings

2.6k Views Asked by Mittenchops At 22 August 2025 at 17:52

What's the right way to return a limited number of layers using the longformer API?

Unlike this case in basic BERT, it's not clear to me from the return type how to get only the last N layers.

So, I run this:

from transformers import LongformerTokenizer, LongformerModel

text = "word " * 4096 # long document!

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

encoded_input = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
output = model(**encoded_input)

And I get dimensions like so from my return:

>>> output[0].shape
torch.Size([1, 4096, 768])

>>> output[1].shape
torch.Size([1, 768])

You can see the shape of [0] is curiously similar to my number of tokens. I believe that slicing this would just give me fewer tokens, not just the last N layers.

Update from answer below

Even asking for output_hidden_states, the dimensions still look off, and it's not clear to me how to reduce these to vector sized, 1-d embedding. Here's what I mean:

encoded_input = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True)
output = model(**encoded_input, output_hidden_states=True)

Ok, now let's look into output[2], the third item of the tuple:

>>> len(output[2])
13

Suppose we want to see the last 3 of the 13 layers:

>>> [pair[0].shape for pair in output[2][-3:]]
[torch.Size([4096, 768]), torch.Size([4096, 768]), torch.Size([4096, 768])]

So we see each of the 13 layers is shaped (4096 x 768), and they look like:

>>> [pair[0] for pair in output[2][-3:]]
[tensor([[-0.1494,  0.0190,  0.0389,  ..., -0.0470,  0.0259,  0.0609],

We still have a size of 4096, in that it corresponds to my token count:

>>> np.mean(np.stack([pair[0].detach().numpy() for pair in output[2][-3:]]), axis=0).shape
(4096, 768)

Averaging these together does not seem like it would give a valid embedding (for comparisons like cosine similarity).

Original Q&A

There are 2 best solutions below

**cronoik** · Answer 1

output is a tuple consisting of two elements:

sequence_output (i.e. last encoder block)
pooled_output

In order to obtain all hidden layers, you need to set the parameter output_hidden_states to true:

output = model(**encoded_input, output_hidden_states=True)

The output has now 3 elements and the third element contains the output of the embedding layer and each encoding layer.

**MAC** · Answer 2

print(outputs.keys())            
#odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])

print("outputs[0] gives us sequence_output: \n", outputs[0].shape) #torch.Size([1, 34, 768])

print("outputs[1] gives us pooled_output  \n",outputs[1]) #Embeddings ( last hidden state) #[768]
            
print("outputs[2]: gives us Hidden_output: \n ",outputs[2][0].shape) #torch.Size([1, 512, 768])

For your use-case you can use output[1] as embeddings.

The last layers of longformer for document embeddings

Update from answer below

There are 2 best solutions below

Related Questions in WORD-EMBEDDING

Related Questions in HUGGINGFACE-TRANSFORMERS

Trending Questions

Popular # Hahtags

Popular Questions