I want to use map
function after importing huggingface dataset [mkqa], so I download it firstly,
then put all to local path, "/data/mkqa-Chinese"
and,
from datasets import Dataset, load_dataset
raw_dataset = load_dataset(data_path)
the structure of raw_dataset like,
DatasetDict({
train: Dataset({
features: ['query', 'answers', 'queries', 'example_id'],
num_rows: 6758
})
})
then, I want map it by tokenizer, like,
model_path = "/data/bigscience/bloomz-3b"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_Fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'
def tok(sample):
prompt_and_chosen = " Human: " + sample['queries']['zh_cn'] + " Assistant: " + sample['answers']['zh_cn'][0]['text']
model_inps = tokenizer(prompt_and_chosen, padding=True, max_length=512, truncation=True)
return model_inps
tokenized_training_data = raw_dataset['train'].map(tok, batched=True)
print(tokenized_training_data)
print("pause")
However, it shows typeerror,
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/novo_trl_sft.py", line 548, in tok
prompt_and_chosen = " Human: " + sample['queries']['zh_cn'] + " Assistant: " + sample['answers']['zh_cn'][0]['text']
TypeError: list indices must be integers or slices, not str
I guess the problem is on the class of Dataset, but how to correct it?
When you set
batched=True
inmap
your function should expect a batch of multiple samples. In your case