Adding token about domain to tokenizer and fine-tuning is both essential?
a. Is it right process to adding domain token to tokenizer before fine-tuning model? b. If I just adding domain token without fine-tuning, it could be improve in performance? c. If I just fine-tuning without adding domain token, it could be improve in performance? d. For improve model in performance, how many domain sentences would be need?
Thanks
I added just 5K domain token. I have just a few domain sentence for fine-tuning.
From your query, I'm trying to provide an answer based on some assumptions in each case.
In general, a tokenizer is essentially trying to represent relationship between words(tokens ideally) in an N dimensional space.
For the above cases, you may probably need to pre-train from scratch with your own datapoints instead of fine tuning.
How much text is needed? I cannot state with a number but the more the better as it will help tokenizer to be able to represent the text accurately.
As far as I know you you cannot add the text directly to the tokenizer as tokenizer is also a result of training where it learns to represent a relationship between tokens.