How do you get single embedding vector for each word (token) from RoBERTa?

1k Views Asked by At

As you may know, RoBERTa (BERT, etc.) has its own tokenizer and sometimes you get pieces of given word as tokens, e.g. embeddings » embed, #dings

Since the nature of the task I am working on, I need a single representation for each word. How do I get it?

CLEARANCE:

sentence: "embeddings are good" --> 3 word tokens given
output: [embed,#dings,are,good] --> 4 tokens are out

When I give sentence to pre-trained RoBERTa, I get encoded tokens. At the end I need representation for each token. Whats the solution? Summing embed + #dings tokens point-wise?

1

There are 1 best solutions below

0
On

I'm not sure if there is standard practice, but what I saw the others have done is to simply take the average of the sub-tokens embeddings. example: https://arxiv.org/abs/2006.01346, Section 2.3 line 4