I'm currently trying to fine-tune DistilGPT-2 (with Pytorch and HuggingFace transformers library) for a code completion task. My corpus is arranged like the following example:
<|startoftext|>
public class FindCityByIdService {
private CityRepository cityRepository = ...
<|endoftext|>
My first attempt was to run the following script from transformers library:
python run_clm.py
--model_type=gpt2 \
--model_name_or_path distilgpt2 \
--do_train \
--train_file $TRAIN_FILE \
--num_train_epochs 100 \
--output_dir $OUTPUT_DIR \
--overwrite_output_dir \
--save_steps 20000 \
--per_device_train_batch_size 4 \
After doing some generation tests, I realized that the model is not predicting \ n
for any given context. I imagine that some pre-process stage or something similar is missing. But anyway, what should I do so that \ n
be predicted as expected?
Thanks!!
I think I found a hacky solution for this.
In
run_clm.py
change:to:
When the Dataset is initially built, it splits it by lines without keeping the newlines on each line. Then the
group_texts
method concatenates them into batches without adding newlines back. So changingtokenize_function
to append\n
to each line gives us those newlines back.Just tested this change out on my fine-tuning job and it worked! Getting newlines being generated in the resulting model.