Re: legacy assumptions - Mailing list pgsql-docs

From Jonathan Buhacoff
Subject Re: legacy assumptions
Date
Msg-id 62f15e85-4bc7-ac8b-b0b5-05412e81e2b0@buhacoff.net
Whole thread Raw
In response to Re: legacy assumptions  ("Jonathan S. Katz" <jkatz@postgresql.org>)
List pgsql-docs
Thank you for sharing that alternative approach.

"In general, most applications should prefer to store JSON data as jsonb,
unless there are quite specialized needs, such as assumptions about ordering
of object keys or the need to retrieve the data exactly as it was stored."

I don't know how to distill your example into a succinct addition to
that paragraph. A brief Internet search I just tried didn't yield
anything quite like it in the first few pages. I think including the
full example on that page, or linking to it, would help a lot for anyone
with these specialized requirements who is reading that page.

Since you mentioned the RFC, I'll just point out that in my application,
the integrity check is on the encoded JSON data (before parsing), not on
the parsed JSON object in memory -- so there is no assumption about the
ordering of keys. The JSON is not even parsed by the application if the
integrity check fails. When the application writes data, it encodes the
JSON first, then computes the integrity on the encoded bytes. The data
could be stored in any format and the procedure would be the same. The
reason for using the json type (or bytea with expression index, jsonb,
and gin as you suggested) is to be able to select records based on the
un-trusted JSON data before validating and re-parsing it in the application.

Jonathan B

On 11/25/2019 4:28 PM, Jonathan S. Katz wrote:
> Hi,
>
> On 11/25/19 12:47 PM, PG Doc comments form wrote:
>> The following documentation comment has been logged on the website:
>>
>> Page: https://www.postgresql.org/docs/12/datatype-json.html
>> Description:
>>
>> I'm wondering if this one line of section 8.14 JSON Types
>> (https://www.postgresql.org/docs/current/datatype-json.html) can be edited
>> to remove the word "legacy":
>>
>> "In general, most applications should prefer to store JSON data as jsonb,
>> unless there are quite specialized needs, such as legacy assumptions about
>> ordering of object keys."
>>
>> I'm concerned that with the word "legacy" there, someone might come along
>> eventually and decide the json column type isn't needed anymore because it's
>> "legacy", where in fact there are modern and legitimate uses for a field
>> that allows you to retrieve the data exactly as it was stored and allows
>> JSON queries on that data (even if they are slower).
> While I'm certainly sensitive to this need as once upon a time I had a
> similar requirement, slightly less strict requirement, I made sure to
> not rely on the PostgreSQL JSON type itself to ensure ordering was
> preserved (and in my case I was able to rely on a solution external to
> PostgreSQL).
>
> The JSON RFC states that objects should be considered "unordered", and
> mentions that while different parsing libraries may preserve key
> ordering, "implementations whose behavior does not depend on member
> ordering will be interoperable in the sense that they will not be
> affected by these differences."[1]
>
>> An alternative would be to store the
>> plaintext as binary data for the integrity check and have a separate jsonb
>> column with a second copy of the same data. Since different applications
>> have different time/space tradeoffs, it's good to have the choice.
> Another approach is to leverage PostgreSQL's expression index
> capabilities, which would allow you to limit the data duplication. For
> example:
>
> CREATE TABLE docs (doc bytea);
>
> -- populating some test data
> INSERT INTO docs
> SELECT ('{"id": ' || x || ', "data": [1,2,3] }')::bytea
> FROM generate_series(1, 100000) x;
>
> -- create an expression index that maps to the operators supported by GIN
> CREATE INDEX docs_doc_json_idx ON docs
>     USING gin(jsonb(encode(doc, 'escape')));
>
> and in one test run:
>
> EXPLAIN
> SELECT doc
> FROM docs WHERE encode(doc, 'escape')::jsonb @> '{"id": 567}';
>
> I got a plan similar to:
>
>                                      QUERY PLAN
>
> ------------------------------------------------------------------------------------
>  Bitmap Heap Scan on docs  (cost=28.77..306.00 rows=100 width=31)
>    Recheck Cond: ((encode(doc, 'escape'::text))::jsonb @> '{"id":
> 567}'::jsonb)
>    ->  Bitmap Index Scan on docs_doc_json_idx  (cost=0.00..28.75
> rows=100 width=0)
>          Index Cond: ((encode(doc, 'escape'::text))::jsonb @> '{"id":
> 567}'::jsonb)
>
> In this way, you can:
>
> - Keep the key ordering preserved and perform any integrity checks, etc.
> that your application requires
> - Limit your data duplication to that of the index
> - Still get the benefits of the JSONB lookup functions that work with
> the indexing
> - Still perform JSON validation:
>
> INSERT INTO docs VALUES ('{]'::bytea);
>
> ERROR:  invalid input syntax for type json
> DETAIL:  Expected string or "}", but found "]".
> CONTEXT:  JSON data, line 1: {]
>
>> My suggestion for that sentence:
>>
>> "In general, most applications should prefer to store JSON data as jsonb,
>> unless there are quite specialized needs, such as assumptions about ordering
>> of object keys or the need to retrieve the data exactly as it was stored."
> My preference would be that we guide in the documentation on what to do
> if one has an application sensitive to ordering. I'm not opposed to the
> wording, but I'd prefer we encourage people to leverage JSONB for
> storage & retrieval.
>
> Thanks!
>
> Jonathan
>
> [1] https://tools.ietf.org/html/rfc7159#section-4
>



pgsql-docs by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: legacy assumptions
Next
From: Alvaro Herrera
Date:
Subject: Re: Example not producing shown output