Following https://huggingface.co/docs/datasets/en/loading#json I am trying to load this dataset https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/date_understanding/task.json
dataset = load_dataset("json", data_files= "task.json", field="examples")
to hugging face dataset class.
But the structure of data is not maintained correctly. For example, I cannot access the first row of this data correctly, since target_scores are messed up.
Can someone direct me why?
That is how my first row shows up:
{'input': 'Yesterday was April 30, 2021. What is the date today in MM/DD/YYYY?', 'target_scores': {'01/01/1827': None, '01/01/1845': None, '01/01/1891': None, '01/01/1898': None, '01/01/1899': None, '01/01/1900': None, '01/01/1929': None, '01/01/1930': None, '01/01/1934': None, '01/01/1951': None, '01/01/1958': None, '01/01/1961': None, '01/01/1964': None, '01/01/1987': None,... }}
I expect to get this:
{"input": "Yesterday was April 30, 2021. What is the date today in MM/DD/YYYY?", "target_scores": { "05/01/2021": 1, "02/23/2021": 0, "03/11/2021": 0, "05/09/2021": 0, "06/12/2021": 0, "04/29/2021": 0 } }
datasetsexpects data in a columnar/tabular format: all the examples in the dataset must have the same fields and subfields. However your "target_scores" field has subfields that are different between examples, sodatasetsfills the missing ones withNone.You can try using a list formatted like this instead: