Python error: Extra data: line 1 in loading a big Json file

548 Views Asked by At

I am trying to read a JSON file which is 370 MB

import json
data = open( "data.json" ,"r")
json.loads(data.read())

and it's not possible to easily find the root cause of the following error,

json.decoder.JSONDecodeError: Extra data: line 1 column 1024109 (char 1024108)

I looked at similar questions and tried the following StackOverflow answer

import json
data = [json.loads(line) for line in open('data.json', 'r')]

But it didn't resolve the issue. I am wondering if there is any solution to find where the error happens in the file. I am getting some other files from the same source and they run without any problem.

A small piece of the Json file is a list of dicts like,

{
"uri": "p",
"source": {
    "uri": "dail",
    "dataType": "pr",
    "title": "Daily"
},
"authors": [
    {
        "type": "author",
        "isAgency": false
    }
],
"concepts": [

    {
        "amb": false,
        "imp": true,
        "date": "2019-05-23",
        "textStart": 2459,
        "textEnd": 2467
    },
    {
        "amb": false,
        "imp": true,
        "date": "2019-05-09",
        "textStart": 2684,
        "textEnd": 2691
    }
],
"shares": {},
"wgt": 100,
"relevance": 100
}
1

There are 1 best solutions below

1
On

The problem with json library is loaded everything to memory and parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.

Instead I would suggest to take a look at https://github.com/henu/bigjson

import bigjson

with open('data.json', 'rb') as f:
    json_data = bigjson.load(f)