Unsupervised Fine-tuning for ASR

104 Views Asked by At

Using the tutorial by Patrick von Platen (https://huggingface.co/blog/fine-tune-xlsr-wav2vec2), I managed to fine-tune Wav2Vec2 for annotated audio datasets in a supervised manner.

I now have a custom dataset which only consists of unannotated audio data dsb-untranscribed in one specific language. I want to "fine-tune" wav2vec2-large-xls-r-300m using this audio data, before I later actually fine-tune this data using an annotated dataset. This means, I need to train in an unsupervised manner. How do I do this correctly? Here is what I am doing so far:

I have started by loading my dataset as follows:

from datasets import load_dataset

dsb_untranscribed = load_dataset("TiMauzi/dsb-untranscribed")

Then I did some preprocessing:

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Processor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True,
                                             return_attention_mask=True)

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
dsb_untranscribed = dsb_untranscribed.cast_column("audio", Audio(sampling_rate=16_000))

Finally, I loaded the pretrained model (not sure whether a tokenizer and vocab size is needed here):

from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m",
    attention_dropout=0.0,
    hidden_dropout=0.0,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.0,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

model.freeze_feature_extractor()

Now I would need to define my Trainer and TrainingArguments before I can use trainer.train() to start the training process. How do I do this correctly?

0

There are 0 best solutions below