How to use ijson/other to parse this large JSON file?

4.7k Views Asked by At

I have this massive json file (8gb), and I run out of memory when trying to read it in to Python. How would I implement a similar procedure using ijson or some other library that is more efficient with large json files?

import pandas as pd

#There are (say) 1m objects - each is its json object - within in this file. 
with open('my_file.json') as json_file:      
    data = json_file.readlines()
    #So I take a list of these json objects
    list_of_objs = [obj for obj in data]

#But I only want about 200 of the json objects
desired_data = [obj for obj in list_of_objs if object['feature']=="desired_feature"]

How would I implement this using ijson or something similar? Is there a way I can extract the objects I want without reading in the whole JSON file?

The file is a list of objects like:

{
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",
    "stars": 4,
    "date": "2016-03-09",
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
    "useful": 0,
    "funny": 0,
}
2

There are 2 best solutions below

6
On

The problem is that not all JSON comes nicely formatted and you cannot rely on line-by-line parsing to extract your objects. I understood your "acceptance criteria" as "want to collect only those JSON objects whose specified keys contain specified values". For example, only collecting objects about a person if that person's name is "Bob". The following function will provide a list of all objects that fit your criteria. Parsing is done character by character (something that would be much more efficient in C, but Python is still pretty good). This should be more robust because it doesn't care about newlines, formatting etc. I tested this on both formatted and unformatted JSON with 1,000,000 objects.

import json

def parse_out_objects(file, feature, desired_value):
    with open(file) as f:
        compose_object_flag = False
        ignore_characters_flag = False
        object_string = ''
        selected_objects = []
        json_object = None
        while True:
            c = f.read(1)
            if c == '"':
                ignore_characters_flag = not ignore_characters_flag
            if c == '{' and ignore_characters_flag == False:
                compose_object_flag = True
            if c == '}' and compose_object_flag == True and ignore_characters_flag == False:
                compose_object_flag = False
                object_string = object_string + '}'
                json_object = json.loads(object_string)
                if json_object[feature] == desired_value:
                    selected_objects.append(json_object)
                object_string = ''
            if compose_object_flag == True:
                object_string = object_string + c
            if not c:
                break
        return selected_objects
0
On

The file is a list of objects

This is a little ambiguous. Looking at your code snippet it looks like your file contains separate JSON object on each line. Which is not the same as the actual JSON array that starts with [, ends with ] and has , between items.

In the case of a json-per-line file it's as easy as:

import json
from itertools import islice

with(open(filename)) as f:
    objects = (json.loads(line) for line in f)
    objects = islice(objects, 200)

Note the differences:

  • you don't need .readlines(), the file object itself is an iterable that yields individual lines
  • parentheses (..) instead of brackets [..] in (... for line in f) create a lazy generator expression instead of a Python list in memory with all the lines
  • islice(objects, 200) will give you the first 200 items without iterating further. If objects would've been a list you could just do objects[:200]

Now, if your file is actually a JSON array then you indeed need ijson:

import ijson  # or choose a faster backend if needed
from itertools import islice

with open(filename) as f:
    objects = ijson.items(f, 'item')
    objects = islice(objects, 200)

ijson.items returns a lazy iterator over a parsed array. The 'item' in the second parameter means "each item in a top-level array".