Huggingface load_dataset messes up the structure of the dataset

79 Views Asked by At

Following https://huggingface.co/docs/datasets/en/loading#json I am trying to load this dataset https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/date_understanding/task.json

dataset = load_dataset("json", data_files= "task.json", field="examples")

to hugging face dataset class.

But the structure of data is not maintained correctly. For example, I cannot access the first row of this data correctly, since target_scores are messed up.

Can someone direct me why?

That is how my first row shows up:

{'input': 'Yesterday was April 30, 2021. What is the date today in MM/DD/YYYY?', 'target_scores': {'01/01/1827': None, '01/01/1845': None, '01/01/1891': None, '01/01/1898': None, '01/01/1899': None, '01/01/1900': None, '01/01/1929': None, '01/01/1930': None, '01/01/1934': None, '01/01/1951': None, '01/01/1958': None, '01/01/1961': None, '01/01/1964': None, '01/01/1987': None,... }}

I expect to get this:

{"input": "Yesterday was April 30, 2021. What is the date today in MM/DD/YYYY?", "target_scores": { "05/01/2021": 1, "02/23/2021": 0, "03/11/2021": 0, "05/09/2021": 0, "06/12/2021": 0, "04/29/2021": 0 } }

1

There are 1 best solutions below

0
Quentin Lhoest On

datasets expects data in a columnar/tabular format: all the examples in the dataset must have the same fields and subfields. However your "target_scores" field has subfields that are different between examples, so datasets fills the missing ones with None.

You can try using a list formatted like this instead:

{
  "input": "Yesterday was April 30, 2021. What is the date today in MM/DD/YYYY?",
  "target_scores": [{"date": "05/01/2021", "score": 1}, {"date": "02/23/2021", "score": 0}]
}