How to Find Positional embeddings from BARTTokenizer?

127 Views Asked by At

The objective is to add token embeddings (customized- obtained using different model) and the positional Embeddings.

Is there a Way I can find out positonal embedding along with the token embeddings for an article(length 500-1000 words) using BART model.

tokenized_sequence = tokenizer(sentence, padding='max_length', truncation=True, max_length=512, return_tensors="pt")

the output is input_ids and attention_mask but not parameter to return position_ids like in BERT model.

bert.embeddings.position_embeddings('YOUR_POSITIONS_IDS')

Or the only way to obtain Positional Embedding is using sinusoidal positional encoding?

1

There are 1 best solutions below

6
On BEST ANSWER

The tokenizer is not responsible for the embeddings. It only generates the ids to be fed into the embedding layer. Barts embeddings are learned, i.e. the embedding come from their own embedding layer.

You can retrieve both types of embeddings like this. Here bart is a BartModel. The encoding is (roughly) done like this:

embed_pos = bart.encoder.embed_positions(input_ids)
inputs_embeds = bart.encoder.embed_tokens(input_ids)
hidden_states = inputs_embeds + embed_pos

Full working code:

from transformers import BartForConditionalGeneration, BartTokenizer

bart = BartForConditionalGeneration.from_pretrained("facebook/bart-base", forced_bos_token_id=0)
tok = BartTokenizer.from_pretrained("facebook/bart-base")
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
input_ids = tok(example_english_phrase, return_tensors="pt").input_ids

embed_pos = bart.model.encoder.embed_positions(input_ids) * bart.model.encoder.embed_scale # by default the scale is 1.0
inputs_embeds = bart.model.encoder.embed_tokens(input_ids)
hidden_states = inputs_embeds + embed_pos

Note that embed_pos is invariant to the actual token ids. Only their position matters. "New" embeddings are added if the input grows larger without changing the embeddings of the earlier positions:

These cases yield the same embeddings: embed_positions([0, 1]) == embed_positions([123, 241]) == embed_positions([444, 3453, 9344, 3453])[:2]