Why new lines aren't generated with my fine-tuned DistilGPT2 model?

1.3k Views Asked by At

I'm currently trying to fine-tune DistilGPT-2 (with Pytorch and HuggingFace transformers library) for a code completion task. My corpus is arranged like the following example:

<|startoftext|>
public class FindCityByIdService {
    private CityRepository cityRepository = ...
<|endoftext|>

My first attempt was to run the following script from transformers library:

python run_clm.py 
     --model_type=gpt2 \
     --model_name_or_path distilgpt2 \
     --do_train \
     --train_file $TRAIN_FILE \
     --num_train_epochs 100 \
     --output_dir $OUTPUT_DIR \
     --overwrite_output_dir \
     --save_steps 20000 \
     --per_device_train_batch_size 4 \

After doing some generation tests, I realized that the model is not predicting \ n for any given context. I imagine that some pre-process stage or something similar is missing. But anyway, what should I do so that \ n be predicted as expected?

HF Forum question

Thanks!!

1

There are 1 best solutions below

1
On

I think I found a hacky solution for this.

In run_clm.py change:

    def tokenize_function(examples):
        return tokenizer(examples[text_column_name])

to:

    def tokenize_function(examples):
        return tokenizer([example + "\n" for example in examples[text_column_name]])

When the Dataset is initially built, it splits it by lines without keeping the newlines on each line. Then the group_texts method concatenates them into batches without adding newlines back. So changing tokenize_function to append \n to each line gives us those newlines back.

Just tested this change out on my fine-tuning job and it worked! Getting newlines being generated in the resulting model.