Re: additional json functionality - Mailing list pgsql-hackers

From Maciej Gajewski
Subject Re: additional json functionality
Date
Msg-id CAEcSYXKuq2o6GM+2pasEsfhCz5y_nEUdPYegcncdcRyorx0w-A@mail.gmail.com
Whole thread Raw
In response to Re: additional json functionality  (Greg Stark <stark@mit.edu>)
List pgsql-hackers

Hi everyone

I used to work on a project storing large quantities of schema-less data, initially using MongoDB, then Postgres with JSON, and eventually I implemented BSON support for Postgres to get the best of both worlds: https://github.com/maciekgajewski/postgresbson

I don't think that JSONB is a good idea. There is a lot to learn from MongoDB's mistakes in this area.

1. As noted in this thread previously, JSON is a serialization format, not a document format.

2. Almost any structured data type, self-describing or not, can be serialized to/from JSON, but always using only subset of it, and interpreting it in it's own specific way.

3. JSON greatest strength is interoperability. It is a great feature of Postgres that JSON is stored as a text; it's basically a 'text, but you can do something with it'.  There is many JSON implementations out there, and one should make no assumption about application's expectations.
For instance: JSON standard (RFS-4627) defines all number to be doubles. Yet I've seen application storing 64-bit integers (wouldn't fit in double precision), or even arbitrary precision integers. Most parsers are OK with that.

4. JSON greatest weakness is performance. Because of 1. it needs to be parsed before any useful information is extracted.

5. 1. and 3. are mutually exclusive; this is one of the most valuable takeaways I have from working with Mongo and BSON in particular. BSON is an attempt to create 'binary JSON', and a failed one. It is a poor serialization format: faster than JSON, but less flexible. Being binary, it is strongly typed, and it uses various gimmicks to preserve flexibility: implicit type casts, 3 different equality comparison functions etc. And it's not fully convertible to/from JSON; no binary format is.
It is a poor document format as well: it retains some of the JSON's performance problems: serial nature and ineffective storage.

6. Speed matters to some, and being able to generate binary data in application and send it to database without any serialization/parsing in between provides great optimization opportunity. One thing that Mongo guys got right is the fixed, well-defined binary representation. Application can use provided library to generate objects, and doesn't need to worry about server's version or endianess.

In the application I've mention before, switching from JSON to BSON (in Postgres 9.2, using postgresbson) increased throughput by an order of magnitude. It was an insert-heavy database with indexes on object fields. Both serializing in application and desalinizing in server was faster ~10x.

7. It seems to me that JSONB is going to repeat all the mistakes of BSON, it's going to be 'neither'. If there is an agreement that Postgres needs a 'document' format, why not acknowledge 5., and simply adopt one of the existing formats. Or even better: adopt none, provide many, provide binary send/recv and conversion to and from JSON, let the user choose.

The world is full of self-describing binary formats: BSON, MessagePack (http://msgpack.org/), protobuf, hierarchical H-Store is coming along. Adding another one would create confusion, and a situation similar to this: http://xkcd.com/927/


And a side note:

Postgres' greatest and most under-advertised feature is it's extensibility. People tend to notice only the features present in the core package, while there should be a huge banner on top of http://www.postgresql.org/: "Kids, we support all data types: we have XML, we have JSON, we have H-store, we have BSON, and all it with build-in indexing, storage compression and full transaction support!"



Maciej G.

pgsql-hackers by date:

Previous
From: Atri Sharma
Date:
Subject: Re: WITHIN GROUP patch
Next
From: David Rowley
Date:
Subject: b21de4e7b32f868a23bdc5507898d36cbe146164 seems to be two bricks shy of a load