What's the fastest way to locate specific data in a large JSON?

1.5k Views Asked by At

I have a JSON which can contain over a million records (each record is a simple object with some fields, but the heirarchy to get to it is contain about 5 levels). I need to find the records containing some values for the fields, preferably in a generic way in node.js.

I tried jsonpath-plus which does exactly what I want. The problem is that processing that much data takes about 25 seconds (if I return only the data, without the path, it takes 10s).

I tried json_query (which is an adaptation of DOJO JSonQuery to node.js). This is working really fast (1s) but only returns the data and not the path to the data.

I was wondering if you can think of alternatives I can use or how can I make jsonpath-plus work faster.

Clarification: I don't generate the data. I receive it with no way of controlling that. I receive the full JSON blob and then I have to perform a few (about 5) queries on it before I get a new one.

Sincerely, Elad

2

There are 2 best solutions below

0
On

Clarification: I don't generate the data. I receive it with no way of controlling that.

But, you could load it into a database and have it generate an index for you, pre-optimized for the queries you need. (Alternatively, you could build such an index in your own application.)

I receive the full JSON blob and then I have to perform a few (about 5) queries on it before I get a new one.

You'll have to determine whether or not building an index to run these queries is worth it. If you're only doing 5 queries, and they require a full search of the data, then it might be faster just to slog through it the way you are now.

One other thought... is this line-delimited JSON where each record is its own object? If so, you could parallelize this searching.

0
On

I ended up modifying the json_query code to return the path. Adding caching made it work even faster (~200ms per query).

As for all the suggestions to use a DB:
I am well aware of the merits of using a DB, but I think that for my specific use case, it isn't worth the trouble. The size I stated earlier is a worst case scenario (by far). I usually expect to have 50K records. Even so, and as demonstrated by the json_query package, running a full scan on 1M records should not be a problem for modern computers.