Pretraining BERT Models from scratch vs Further Pretraining

70 Views Asked by At

I want to pretrain an Arabic BERT model on domain-specific data to make it suitable for a specific domain problem, which is the classification of citizen reviews about government services into relevant government sectors. My plan is to pretrain the model on freely available Arabic newspaper articles that specifically tackle the same sectors as the government ones, including education, healthcare, etc. I know these articles are not considered too specific to the target domain, but they are the only suitable data available. I plan to pretrain the model on around 20K articles only since I am limited with time and computational resources. Also, the target dataset contains about 2K citizen reviews provided in Modern Standard Arabic.

So, I have several questions concerning this project:

  1. Would it be beneficial to pretrain the Arabic BERT model from scratch using this small dataset of 20K samples? or would it be too small to tackle my problem?

  2. Would it be better to apply further pretraining for Arabic BERT model, which means starting with the model initial knowledge (weights) and then further pretraining it on the 20K samples? I am afraid this will lead to model forgetting for the previously learnt knowledge. Also, the combination of general and specific knowledge might affect the model performance on the target dataset of citizen reviews.

  3. Whichever method I choose from above, should I pretrain the model on unlabeled data (unsupervised learning)? or is it better to train it on labeled data to be useful for text classification?

  4. After pretraining the model, should I apply feature extraction or fine-tuning on the target dataset of citizen reviews?

After extensive research, I found that domain-specific models outperform the general ones. Also, it is advised to use pretraining from scartch to make the model specific to the target domain. However, this requires large amount of data and computational power. So, I am not sure if 20K samples are enough. Furthermore, I am not sure if further pretraining will be beneficial for the specific domain target data.

0

There are 0 best solutions below