How to load a large JSON file to a Pandas Dataframe

1.7k Views Asked by At

I have 16 JSON files each of them is about 14GB in size. I've tried the following approach to read them line by line.

with open(file_name, encoding="UTF-8") as json_file:
cursor = 0
for line_number, line in enumerate(json_file):
    print ("Processing line", line_number + 1,"at cursor index:", cursor)
    line_as_file = io.StringIO(line)
    # Use a new parser for each line
    json_parser = ijson.parse(line_as_file)
    for prefix, type, value in json_parser:
        #print ("prefix=",prefix, "type=",type, "value=",value,ignore_index=True)
        dfObj = dfObj.append({"prefix":prefix,"type":type,"value":value},ignore_index=True)
    cursor += len(line)

My aim is to load them into a pandas data frame to perform some search operations.

The problem is that this approach takes a lot of time to read the file.

Is there any other optimal approach to achieve this?

2

There are 2 best solutions below

3
On

You can pass down json_file only once directly to ijson.parse instead of reading individual lines out of it. If your files have more than one top-level JSON value then you can use the multiple_value=True option (see here for a description).

Also make sure you are using an up-to-date ijson, and that the yajl2_c backend is the one in use (in ijson 3 you can see which backend is selected by looking at ijson.backend). For information on backends have a look here.

0
On

You can use Pandas builtin function
pandas.read_json()
The documentation is Here