TypeError: list indices must be integers or slices, not str when import HF dataset from local path

136 Views Asked by At

I want to use map function after importing huggingface dataset [mkqa], so I download it firstly, then put all to local path, "/data/mkqa-Chinese" and,

from datasets import Dataset, load_dataset
raw_dataset = load_dataset(data_path) 

the structure of raw_dataset like,

    DatasetDict({
    train: Dataset({
        features: ['query', 'answers', 'queries', 'example_id'],
        num_rows: 6758
    })
})

then, I want map it by tokenizer, like,

model_path = "/data/bigscience/bloomz-3b"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_Fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'
def tok(sample):
    prompt_and_chosen = " Human: " + sample['queries']['zh_cn'] + " Assistant: " + sample['answers']['zh_cn'][0]['text']
    model_inps =  tokenizer(prompt_and_chosen, padding=True, max_length=512, truncation=True)
    return model_inps

tokenized_training_data = raw_dataset['train'].map(tok, batched=True)
print(tokenized_training_data)
print("pause")

However, it shows typeerror,

 processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/novo_trl_sft.py", line 548, in tok
    prompt_and_chosen = " Human: " + sample['queries']['zh_cn'] + " Assistant: " + sample['answers']['zh_cn'][0]['text']
TypeError: list indices must be integers or slices, not str

I guess the problem is on the class of Dataset, but how to correct it?

1

There are 1 best solutions below

0
On

When you set batched=True in map your function should expect a batch of multiple samples. In your case

def tok(batch):
    prompt_and_chosen = [" Human: " + query['zh_cn'] + " Assistant: " + answer['zh_cn'][0]['text'] for sample, query in zip(batch['queries'], batch['answers'])]
    model_inps =  tokenizer(prompt_and_chosen, padding=True, max_length=512, truncation=True)
    return model_inps