I am trying to train BERT from scratch on a domain specific dataset using the official tensorflow github repository
I used this part of documentation to adapt the scripts to my use case, but I have a problem. First I use the create_pretraining_data.py
script which processes the .txt
file to .tfrecord
. Everything goes well here, but when I run the train.py
script which starts to train the BERT model, next_sentence_accuracy
increases after some steps, but masked_lm_accuracy
always remains 0.
This is the config.yaml
file given to the train.py
script:
task:
init_checkpoint: ''
model:
cls_heads: [{activation: tanh, cls_token_idx: 0, dropout_rate: 0.1, inner_dim: 768, name: next_sentence, num_classes: 2}]
encoder:
type: bert
bert:
attention_dropout_rate: 0.1
dropout_rate: 0.1
hidden_activation: gelu
hidden_size: 768
initializer_range: 0.02
intermediate_size: 3072
max_position_embeddings: 512
num_attention_heads: 12
num_layers: 12
type_vocab_size: 2
vocab_size: 50000
train_data:
drop_remainder: true
global_batch_size: 32
input_path: 'test_clean_tfrecord/2014/*'
is_training: true
max_predictions_per_seq: 20
seq_length: 128
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
validation_data:
drop_remainder: false
global_batch_size: 32
input_path: 'test_clean_tfrecord/2014/*'
is_training: false
max_predictions_per_seq: 20
seq_length: 128
use_next_sentence_label: true
use_position_id: false
use_v2_feature_names: false
trainer:
checkpoint_interval: 5
max_to_keep: 5
optimizer_config:
learning_rate:
polynomial:
cycle: false
decay_steps: 1000000
end_learning_rate: 0.0
initial_learning_rate: 0.0001
power: 1.0
type: polynomial
optimizer:
type: adamw
warmup:
polynomial:
power: 1
warmup_steps: 10000
type: polynomial
steps_per_loop: 1
summary_interval: 1
train_steps: 200
validation_interval: 5
validation_steps: 64
And this is the output of train.py
after 5 training steps:
2022-12-10 13:21:48.184678: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
C:\Users\Iulian\AppData\Roaming\Python\Python39\site-packages\keras\engine\functional.py:637:
UserWarning: Input dict contained keys ['masked_lm_positions',
'masked_lm_ids', 'masked_lm_weights', 'next_sentence_labels']
which did not match any model input. They will be ignored by the model.
inputs = self._flatten_to_reference_inputs(inputs)
WARNING:tensorflow:Gradients do not exist for variables ['pooler_transform/kernel:0', 'pooler_transform/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?
W1210 13:21:52.408583 13512 utils.py:82] Gradients do not exist for variables ['pooler_transform/kernel:0', 'pooler_transform/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?
WARNING:tensorflow:Gradients do not exist for variables ['pooler_transform/kernel:0', 'pooler_transform/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?
W1210 13:21:58.768023 19348 utils.py:82] Gradients do not exist for variables ['pooler_transform/kernel:0', 'pooler_transform/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?
train | step: 2 | steps/sec: 0.0 | output:
{'learning_rate': 1.9799998e-08,
'lm_example_loss': 10.961581,
'masked_lm_accuracy': 0.0,
'next_sentence_accuracy': 0.5625,
'next_sentence_loss': 0.73979986,
'training_loss': 11.701381}
train | step: 3 | steps/sec: 0.0 | output:
{'learning_rate': 2.97e-08,
'lm_example_loss': 10.981846,
'masked_lm_accuracy': 0.0,
'next_sentence_accuracy': 0.5,
'next_sentence_loss': 0.75065744,
'training_loss': 11.732503}
train | step: 4 | steps/sec: 0.0 | output:
{'learning_rate': 3.9599996e-08,
'lm_example_loss': 10.988701,
'masked_lm_accuracy': 0.0,
'next_sentence_accuracy': 0.5625,
'next_sentence_loss': 0.69400764,
'training_loss': 11.682709}
train | step: 5 | steps/sec: 0.0 | output:
{'learning_rate': 4.9500002e-08,
'lm_example_loss': 11.004994,
'masked_lm_accuracy': 0.0,
'next_sentence_accuracy': 0.75,
'next_sentence_loss': 0.5528765,
'training_loss': 11.557871}
I've tried to look in the source code to find where is masked_lm_accuracy
used (I thought that is needed a special flag to use it) and I found that this accuracy is added by default in the metrics' list of the model:
def build_metrics(self, training=None):
del training
metrics = [
tf.keras.metrics.SparseCategoricalAccuracy(name='masked_lm_accuracy'),
tf.keras.metrics.Mean(name='lm_example_loss')
]
# TODO(hongkuny): rethink how to manage metrics creation with heads.
if self.task_config.train_data.use_next_sentence_label:
metrics.append(
tf.keras.metrics.SparseCategoricalAccuracy(
name='next_sentence_accuracy'))
metrics.append(tf.keras.metrics.Mean(name='next_sentence_loss'))
return metrics
def process_metrics(self, metrics, labels, model_outputs):
with tf.name_scope('MaskedLMTask/process_metrics'):
metrics = dict([(metric.name, metric) for metric in metrics])
if 'masked_lm_accuracy' in metrics:
metrics['masked_lm_accuracy'].update_state(
labels['masked_lm_ids'], model_outputs['mlm_logits'],
labels['masked_lm_weights'])
if 'next_sentence_accuracy' in metrics:
metrics['next_sentence_accuracy'].update_state(
labels['next_sentence_labels'], model_outputs['next_sentence'])
There are a few possible reasons why your masked_lm_accuracy is always zero in BERT pre-training.
Your dataset may not be large enough. BERT is a very large language model, and it requires a large amount of data to train properly. If your dataset is too small, the model may not be able to learn the relationships between words and their contexts.
Your data may not be clean. BERT is a very sensitive model, and it can be easily fooled by noise in the data. If your data contains errors, such as typos or grammatical mistakes, the model may not be able to learn from it properly.
Your hyperparameters may not be set correctly. BERT has a number of hyperparameters that can affect its performance. If these hyperparameters are not set correctly, the model may not be able to learn effectively.
If you are still having trouble getting your masked_lm_accuracy to improve, you may want to try using a larger dataset, cleaning your data, or adjusting your hyperparameters.
Here are some additional tips that may help you improve your masked_lm_accuracy:
Use a diverse dataset. The more diverse your dataset, the better BERT will be able to learn the relationships between words and their contexts.
Clean your data. Make sure that your data is free of errors, such as typos or grammatical mistakes.
Use the correct hyperparameters. The hyperparameters of BERT can have a significant impact on its performance. Make sure that you are using the correct hyperparameters for your dataset and your training goals.
Be patient. BERT can take a long time to train. If you are not seeing improvement in your masked_lm_accuracy, be patient and continue training the model.