Javascript serialization and performance with V8 and PostgreSQL

5.3k Views Asked by At

I have been experimenting with PostgreSQL and PL/V8, which embeds the V8 JavaScript engine into PostgreSQL. Using this, I can query into JSON data inside the database, which is rather awesome.

The basic approach is as follows:

CREATE or REPLACE FUNCTION 
  json_string(data json, key text) RETURNS TEXT AS $$
  var data = JSON.parse(data); 
  return data[key];
$$ LANGUAGE plv8 IMMUTABLE STRICT;

SELECT id, data FROM things WHERE json_string(data,'name') LIKE 'Z%';

Using, V8 I can parse JSON data into JS, then return a field and I can use this as a regular pg query expression.

BUT

On large datasets, performance can be an issue, as for every row I need to parse the data. The parser is fast, but it is definitely the slowest part of the process and it has to happen every time.

What I am trying to work out (to finally get to an actual question) is if there is a way to cache or pre-process the JSON ... even storing a binary representation of the JSON in the table that could be used by V8 automatically as a JS object might be a win. I've had a look at using an alternative format such as messagepack or protobuf, but I don't think they will necessarily be as fast as the native JSON parser in any case.

THOUGHT

PG has blobs and binary types, so the data could be stored in binary, then we just need a way to marshall this into V8.

5

There are 5 best solutions below

3
On

perhaps instead of making the retrieval phase responsible for parsing the data, creating a new data type which could pre-disseminate json data on input might be a better approach?

http://www.postgresql.org/docs/9.2/static/sql-createtype.html

0
On

I don't know if it would be useful here, but I came across this: pg-to-json-serializer. It mentions functionality for:

parsing JSON strings and filling postgreSQL records/arrays from it

I don't know if it would offer any performance benefit over what you've been doing so far though, and I don't really even understand their examples.

Just thought it was worth mentioning.

4
On

I don't have any experience with this, but it got me curious so I did some reading.

JSON only

What about something like the following (untested, BTW)? It doesn't address your question about storing a binary representation of the JSON, it's an attempt to parse all of the JSON at once for all of the rows you're checking, in the hope that it will yield higher performance by reducing the processing overhead of doing it individually for each row. If it succeeds at that, I'm thinking it may result in higher memory consumption though.

The CREATE TYPE...set_of_records() stuff is adapted from the example on the wiki where it mentions that "You can also return records with an array of JSON." I guess it really means "an array of objects".

Is the id value from the DB record embedded in the JSON?

Version #1

CREATE TYPE rec AS (id integer, data text, name text);

CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$

  var records = plv8.execute( "SELECT id, data FROM things" );

  var data = [];


  // Use for loop instead if better performance

  records.forEach( function ( rec, i, arr ) {

    data.push( rec.data );

  } );

  data = "[" + data.join( "," ) + "]";

  data = JSON.parse( data );


  records.forEach( function ( rec, i, arr ) {

    rec.name = data[ i ].name;

  } );


  return records;

$$
LANGUAGE plv8;


SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'

Version #2

This one gets Postgres to aggregate / concatenate some values to cut down on the processing done in JS.

CREATE TYPE rec AS (id integer, data text, name text);

CREATE FUNCTION set_of_records() RETURNS SETOF rec AS
$$

  var cols = plv8.execute(

    "SELECT" +

    "array_agg( id ORDER BY id ) AS id," +

    "string_agg( data, ',' ORDER BY id ) AS data" +

    "FROM things"

  )[0];


  cols.data = JSON.parse( "[" + cols.data + "]" );


  var records = cols.id;


  // Use for loop if better performance

  records.forEach( function ( id, i, arr ) {

    arr[ i ] = {

      id : id,

      data : cols.data[ i ],

      name : cols.data[ i ].name

    };

  } );


  return records;

$$
LANGUAGE plv8;


SELECT id, data FROM set_of_records() WHERE name LIKE 'Z%'

hstore

How would the performance of this compare?: duplicate the JSON data into an hstore column at write time (or if the performance somehow managed to be good enough, convert the JSON to hstore at select time) and use the hstore in your WHERE, e.g.:

SELECT id, data FROM things WHERE hstore_data -> name LIKE 'Z%'

I heard about hstore from here: http://lwn.net/Articles/497069/

The article mentions some other interesting things:

PL/v8 lets you...create expression indexes on specific JSON elements and save them, giving you stored search indexes much like CouchDB's "views".

It doesn't elaborate on that and I don't really know what it's referring to.

There's a comment attributed as "jberkus" that says:

We discussed having a binary JSON type as well, but without a protocol to transmit binary values (BSON isn't at all a standard, and has some serious glitches), there didn't seem to be any point.

If you're interested in working on binary JSON support for PostgreSQL, we'd be interested in having you help out ...

2
On

Postgres supports indexes on arbitrary function calls. The following index should do the trick :

CREATE INDEX json_idx ON things (json_string(field,'name'));
1
On

The short version appears to be that with Pg's new json support, so far there's no way to store json directly in any form other than serialised json text. (This looks likely to change in 9.4)

You seem to want to store a pre-parsed form that's a serialised representation of how v8 represents the json in memory, and that's not currently supported. It's not even clear that v8 offers any kind of binary serialisation/deserialisation of json structures. If it doesn't do so natively, code would need to be added to Pg to produce such a representation and to turn it back into v8 json data structures.

It also wouldn't necessarily be faster:

  • If json was stored in a v8 specific binary form, queries that returned the normal json representation to clients would have to format it each time it was returned, incurring CPU cost.

  • A binary serialised version of json isn't the same thing as storing the v8 json data structures directly in memory. You can't write a data structure that involves any kind of graph of pointers out to disk directly, it has to be serialised. This serialisation and deserialisation has a cost, and it might not even be much faster than parsing the json text representation. It depends a lot on how v8 represents JavaScript objects in memory.

  • The binary serialised representation could easily be bigger, since most json is text and small numbers, where you don't gain any compactness from a binary representation. Since storage size directly affects the speed of table scans, value fetches from TOAST, decompression time required for TOASTed values, index sizes, etc, you could easily land up with slower queries and bigger tables.

I'd be interested to see whether an optimisation like what you describe is possible, and whether it'd turn out to be an optimisation at all.

To gain the benefits you want when doing table scans, I guess what you really need is a format that can be traversed without having to parse it and turn it into what's probably a malloc()'d graph of javascript objects. You want to be able to give a path expression for a field and grab it out directly from the serialised form where it's been read into a Pg read buffer or into shared_buffers. That'd be a really interesting design project, but I'd be surprised if anything like it existed in v8.

What you really need to do is research how the existing json-based object databases do fast searches for arbitrary json paths and what their on-disk representations are, then report back on pgsql-hackers. Maybe there's something to be learned from people who've already solved this - presuming, of course, that they have.

In the mean time, what I'd want to focus on is what the other answers here are doing: Working around the slow point and finding other ways to do what you need. You could also look into helping to optimise the json parser, but depending on whether the v8 one or some other one is in use that might already be far past the point of diminishing returns.

I guess this is one of the areas where there's a trade-off between speed and flexible data representation.