As you may know, RoBERTa (BERT, etc.)
has its own tokenizer and sometimes you get pieces of given word as tokens, e.g. embeddings » embed, #dings
Since the nature of the task I am working on, I need a single representation for each word. How do I get it?
CLEARANCE:
sentence: "embeddings are good" --> 3 word tokens given
output: [embed,#dings,are,good] --> 4 tokens are out
When I give sentence to pre-trained RoBERTa, I get encoded tokens. At the end I need representation for each token. Whats the solution? Summing embed + #dings tokens point-wise?
I'm not sure if there is standard practice, but what I saw the others have done is to simply take the average of the sub-tokens embeddings. example: https://arxiv.org/abs/2006.01346, Section 2.3 line 4