create_pretraining_data.py is writing 0 records to tf_examples.tfrecord while training custom BERT model

257 Views Asked by Anirban Ghorai At 06 June 2025 at 15:38

I am writing a custom BERT model on my own corpus, I generated the vocab file using BertWordPieceTokenizer and then running below code

!python create_pretraining_data.py
--input_file=/content/drive/My Drive/internet_archive_scifi_v3.txt
--output_file=/content/sample_data/tf_examples.tfrecord
--vocab_file=/content/sample_data/sifi_13sep-vocab.txt
--do_lower_case=True
--max_seq_length=128
--max_predictions_per_seq=20
--masked_lm_prob=0.15
--random_seed=12345
--dupe_factor=5

Getting output as :

INFO:tensorflow:*** Reading from input files ***

INFO:tensorflow:*** Writing to output files ***

INFO:tensorflow: /content/sample_data/tf_examples.tfrecord

INFO:tensorflow:Wrote 0 total instances

Not sure why I am always getting 0 instances in tf_examples.tfrecord, what am I doing wrong?

I am using TF version 1.12 FYI..generated vocab file is 290 KB.

Original Q&A

There are 1 best solutions below

waad th On 26 June 2021 at 02:37

It can not read the input file, please use My\ Drive instead of My Drive:

--input_file=/content/drive/My\ Drive/internet_archive_scifi_v3.txt

create_pretraining_data.py is writing 0 records to tf_examples.tfrecord while training custom BERT model

There are 1 best solutions below

Related Questions in TENSORFLOW

Related Questions in NLP

Related Questions in BERT-LANGUAGE-MODEL

Related Questions in GOOGLE-NATURAL-LANGUAGE

Trending Questions

Popular # Hahtags

Popular Questions