How to inspect values in binarized FairSeq datasets?

524 Views Asked by Jindřich At 06 June 2022 at 11:34

Running the fairseq-preprocess script produces binary files with integer indices corresponding to token ids in a dictionary.

When I no longer have the original tokenized texts, what is the simplest way to explore the binarized dataset? The documentation does not say much about how a dataset can be loaded for debugging purposes.

Original Q&A

There are 1 best solutions below

David Dale On 29 September 2022 at 12:20 BEST ANSWER

I worked around this by loading the trained model and using it to decode the binarized sentences back to strings:

from fairseq.models.transformer import TransformerModel

model_dir = ???
data_dir = ???

model = TransformerModel.from_pretrained(
    model_dir,
    checkpoint_file='checkpoint_best.pt',
    data_name_or_path=data_dir,
    bpe='sentencepiece', 
    sentencepiece_model=model_dir + '/sentencepiece.joint.bpe.model'
)
model.task.load_dataset('train')
data_bin = model.task.datasets['train']
train_pairs = [
    (model.decode(item['source']), model.decode(item['target'])) 
    for item in data_bin
]

How to inspect values in binarized FairSeq datasets?

There are 1 best solutions below

Related Questions in PYTORCH

Related Questions in MACHINE-TRANSLATION

Related Questions in FAIRSEQ

Trending Questions

Popular # Hahtags

Popular Questions