Garbage values while reading large JSON files

315 Views Asked by At

I have to read a large json file of size 3 Gb using python.There is a garbage value '][' between the data in the json files.For files with small volume,I used the below script to trim the garbage values.

filename=r'C:\Users\user1\Downloads\samplefile.json'
    with open(filename, encoding="utf8") as json_file:
        data = json_file.read()
data=data.replace('][',',')

For large sized files, I used the below script to read the files and got the following error which was handled using the above script when handling smaller files.

Script:

import ijson
f=ijson.items(open(r'C:\Users\user1\Downloads\samplefile.json','r'),'item')

Error:

IncompleteJSONError: parse error: trailing garbage 82220.00,"NUMBER":1799106.00}][{"DATE":"2021092412504700000 (right here) ------^

I have also used the read_json from pandas to read this but ended up with the same error. Any ideas on how to trim this garbage value would be really helpful.I have not shared the file or some samples as the files are used in a secure system.

I have tried using the file wrapper class as well mentioned below but still ending up the Memory error again

import ijson

class Foo(object):
    def __init__(self, fpath, mode , encoding):
        self.f = fpath
        self.mode = mode
        self.encoding = encoding
    def __enter__(self):
        print ('context begun')
        self.file = open(self.f, self.mode,encoding=self.encoding)
        self.file=self.file.read().replace('][',',')
        return self.file
    def __exit__(self, exc_type, exc_val, exc_tb):
        print ('closed')

        

with Foo(r'C:\Users\user1\Downloads\samplefile.json','r',encoding='utf-8') as json_file:
    objects = ijson.items(json_file, 'items')
0

There are 0 best solutions below