When fine-tuning RoBERTa model to add specific domain knowledge, what is overall process?

89 Views Asked by At

Adding token about domain to tokenizer and fine-tuning is both essential?

a. Is it right process to adding domain token to tokenizer before fine-tuning model? b. If I just adding domain token without fine-tuning, it could be improve in performance? c. If I just fine-tuning without adding domain token, it could be improve in performance? d. For improve model in performance, how many domain sentences would be need?

Thanks

I added just 5K domain token. I have just a few domain sentence for fine-tuning.

1

There are 1 best solutions below

0
On

From your query, I'm trying to provide an answer based on some assumptions in each case.

In general, a tokenizer is essentially trying to represent relationship between words(tokens ideally) in an N dimensional space.

  1. Is your domain you are mentioning is completely unrelated to trained data ?
  2. Does the domain contains words/ sentences that were mostly different from the text the pretrained model is trained on? Examples - plain english text vs code - both look like english but essentially different when it comes to training

For the above cases, you may probably need to pre-train from scratch with your own datapoints instead of fine tuning.

How much text is needed? I cannot state with a number but the more the better as it will help tokenizer to be able to represent the text accurately.

As far as I know you you cannot add the text directly to the tokenizer as tokenizer is also a result of training where it learns to represent a relationship between tokens.