I am trying to read a JSON file which is 370 MB
import json
data = open( "data.json" ,"r")
json.loads(data.read())
and it's not possible to easily find the root cause of the following error,
json.decoder.JSONDecodeError: Extra data: line 1 column 1024109 (char 1024108)
I looked at similar questions and tried the following StackOverflow answer
import json
data = [json.loads(line) for line in open('data.json', 'r')]
But it didn't resolve the issue. I am wondering if there is any solution to find where the error happens in the file. I am getting some other files from the same source and they run without any problem.
A small piece of the Json file is a list of dicts like,
{
"uri": "p",
"source": {
"uri": "dail",
"dataType": "pr",
"title": "Daily"
},
"authors": [
{
"type": "author",
"isAgency": false
}
],
"concepts": [
{
"amb": false,
"imp": true,
"date": "2019-05-23",
"textStart": 2459,
"textEnd": 2467
},
{
"amb": false,
"imp": true,
"date": "2019-05-09",
"textStart": 2684,
"textEnd": 2691
}
],
"shares": {},
"wgt": 100,
"relevance": 100
}
The problem with
json
library is loaded everything to memory and parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.Instead I would suggest to take a look at https://github.com/henu/bigjson