Thread: legacy assumptions

legacy assumptions

From

PG Doc comments form

Date:

25 November 2019, 20:47:09

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/12/datatype-json.html
Description:

I'm wondering if this one line of section 8.14 JSON Types
(https://www.postgresql.org/docs/current/datatype-json.html) can be edited
to remove the word "legacy":

"In general, most applications should prefer to store JSON data as jsonb,
unless there are quite specialized needs, such as legacy assumptions about
ordering of object keys."

I'm concerned that with the word "legacy" there, someone might come along
eventually and decide the json column type isn't needed anymore because it's
"legacy", where in fact there are modern and legitimate uses for a field
that allows you to retrieve the data exactly as it was stored and allows
JSON queries on that data (even if they are slower).

In my application, there's a database table to store encrypted and
integrity-protected information. The ciphertext is in a binary data column,
the ciphertext metadata is in a jsonb column, and integrity-protected
plaintext is in a json column. The integrity protection is implemented as
either the AD portion of AES-GCM AEAD or as a separately-computed HMAC;
either way the bytes must be read exactly as they were stored or the
integrity check will fail. Being able to select records based on the content
of that plaintext json data is nice. An alternative would be to store the
plaintext as binary data for the integrity check and have a separate jsonb
column with a second copy of the same data. Since different applications
have different time/space tradeoffs, it's good to have the choice.

My suggestion for that sentence:

"In general, most applications should prefer to store JSON data as jsonb,
unless there are quite specialized needs, such as assumptions about ordering
of object keys or the need to retrieve the data exactly as it was stored."

Re: legacy assumptions

From

"Jonathan S. Katz"

Date:

26 November 2019, 03:28:43

Hi,

On 11/25/19 12:47 PM, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
>
> Page: https://www.postgresql.org/docs/12/datatype-json.html
> Description:
>
> I'm wondering if this one line of section 8.14 JSON Types
> (https://www.postgresql.org/docs/current/datatype-json.html) can be edited
> to remove the word "legacy":
>
> "In general, most applications should prefer to store JSON data as jsonb,
> unless there are quite specialized needs, such as legacy assumptions about
> ordering of object keys."
>
> I'm concerned that with the word "legacy" there, someone might come along
> eventually and decide the json column type isn't needed anymore because it's
> "legacy", where in fact there are modern and legitimate uses for a field
> that allows you to retrieve the data exactly as it was stored and allows
> JSON queries on that data (even if they are slower).

While I'm certainly sensitive to this need as once upon a time I had a
similar requirement, slightly less strict requirement, I made sure to
not rely on the PostgreSQL JSON type itself to ensure ordering was
preserved (and in my case I was able to rely on a solution external to
PostgreSQL).

The JSON RFC states that objects should be considered "unordered", and
mentions that while different parsing libraries may preserve key
ordering, "implementations whose behavior does not depend on member
ordering will be interoperable in the sense that they will not be
affected by these differences."[1]

> An alternative would be to store the
> plaintext as binary data for the integrity check and have a separate jsonb
> column with a second copy of the same data. Since different applications
> have different time/space tradeoffs, it's good to have the choice.

Another approach is to leverage PostgreSQL's expression index
capabilities, which would allow you to limit the data duplication. For
example:

CREATE TABLE docs (doc bytea);

-- populating some test data
INSERT INTO docs
SELECT ('{"id": ' || x || ', "data": [1,2,3] }')::bytea
FROM generate_series(1, 100000) x;

-- create an expression index that maps to the operators supported by GIN
CREATE INDEX docs_doc_json_idx ON docs
    USING gin(jsonb(encode(doc, 'escape')));

and in one test run:

EXPLAIN
SELECT doc
FROM docs WHERE encode(doc, 'escape')::jsonb @> '{"id": 567}';

I got a plan similar to:

                                     QUERY PLAN

------------------------------------------------------------------------------------
 Bitmap Heap Scan on docs  (cost=28.77..306.00 rows=100 width=31)
   Recheck Cond: ((encode(doc, 'escape'::text))::jsonb @> '{"id":
567}'::jsonb)
   ->  Bitmap Index Scan on docs_doc_json_idx  (cost=0.00..28.75
rows=100 width=0)
         Index Cond: ((encode(doc, 'escape'::text))::jsonb @> '{"id":
567}'::jsonb)

In this way, you can:

- Keep the key ordering preserved and perform any integrity checks, etc.
that your application requires
- Limit your data duplication to that of the index
- Still get the benefits of the JSONB lookup functions that work with
the indexing
- Still perform JSON validation:

INSERT INTO docs VALUES ('{]'::bytea);

ERROR:  invalid input syntax for type json
DETAIL:  Expected string or "}", but found "]".
CONTEXT:  JSON data, line 1: {]

> My suggestion for that sentence:
>
> "In general, most applications should prefer to store JSON data as jsonb,
> unless there are quite specialized needs, such as assumptions about ordering
> of object keys or the need to retrieve the data exactly as it was stored."

My preference would be that we guide in the documentation on what to do
if one has an application sensitive to ordering. I'm not opposed to the
wording, but I'd prefer we encourage people to leverage JSONB for
storage & retrieval.

Thanks!

Jonathan

[1] https://tools.ietf.org/html/rfc7159#section-4

Attachment

signature.asc

Re: legacy assumptions

From

Jonathan Buhacoff

Date:

26 November 2019, 20:41:17

Thank you for sharing that alternative approach.

"In general, most applications should prefer to store JSON data as jsonb,
unless there are quite specialized needs, such as assumptions about ordering
of object keys or the need to retrieve the data exactly as it was stored."

I don't know how to distill your example into a succinct addition to
that paragraph. A brief Internet search I just tried didn't yield
anything quite like it in the first few pages. I think including the
full example on that page, or linking to it, would help a lot for anyone
with these specialized requirements who is reading that page.

Since you mentioned the RFC, I'll just point out that in my application,
the integrity check is on the encoded JSON data (before parsing), not on
the parsed JSON object in memory -- so there is no assumption about the
ordering of keys. The JSON is not even parsed by the application if the
integrity check fails. When the application writes data, it encodes the
JSON first, then computes the integrity on the encoded bytes. The data
could be stored in any format and the procedure would be the same. The
reason for using the json type (or bytea with expression index, jsonb,
and gin as you suggested) is to be able to select records based on the
un-trusted JSON data before validating and re-parsing it in the application.

Jonathan B

On 11/25/2019 4:28 PM, Jonathan S. Katz wrote:
> Hi,
>
> On 11/25/19 12:47 PM, PG Doc comments form wrote:
>> The following documentation comment has been logged on the website:
>>
>> Page: https://www.postgresql.org/docs/12/datatype-json.html
>> Description:
>>
>> I'm wondering if this one line of section 8.14 JSON Types
>> (https://www.postgresql.org/docs/current/datatype-json.html) can be edited
>> to remove the word "legacy":
>>
>> "In general, most applications should prefer to store JSON data as jsonb,
>> unless there are quite specialized needs, such as legacy assumptions about
>> ordering of object keys."
>>
>> I'm concerned that with the word "legacy" there, someone might come along
>> eventually and decide the json column type isn't needed anymore because it's
>> "legacy", where in fact there are modern and legitimate uses for a field
>> that allows you to retrieve the data exactly as it was stored and allows
>> JSON queries on that data (even if they are slower).
> While I'm certainly sensitive to this need as once upon a time I had a
> similar requirement, slightly less strict requirement, I made sure to
> not rely on the PostgreSQL JSON type itself to ensure ordering was
> preserved (and in my case I was able to rely on a solution external to
> PostgreSQL).
>
> The JSON RFC states that objects should be considered "unordered", and
> mentions that while different parsing libraries may preserve key
> ordering, "implementations whose behavior does not depend on member
> ordering will be interoperable in the sense that they will not be
> affected by these differences."[1]
>
>> An alternative would be to store the
>> plaintext as binary data for the integrity check and have a separate jsonb
>> column with a second copy of the same data. Since different applications
>> have different time/space tradeoffs, it's good to have the choice.
> Another approach is to leverage PostgreSQL's expression index
> capabilities, which would allow you to limit the data duplication. For
> example:
>
> CREATE TABLE docs (doc bytea);
>
> -- populating some test data
> INSERT INTO docs
> SELECT ('{"id": ' || x || ', "data": [1,2,3] }')::bytea
> FROM generate_series(1, 100000) x;
>
> -- create an expression index that maps to the operators supported by GIN
> CREATE INDEX docs_doc_json_idx ON docs
>     USING gin(jsonb(encode(doc, 'escape')));
>
> and in one test run:
>
> EXPLAIN
> SELECT doc
> FROM docs WHERE encode(doc, 'escape')::jsonb @> '{"id": 567}';
>
> I got a plan similar to:
>
>                                      QUERY PLAN
>
> ------------------------------------------------------------------------------------
>  Bitmap Heap Scan on docs  (cost=28.77..306.00 rows=100 width=31)
>    Recheck Cond: ((encode(doc, 'escape'::text))::jsonb @> '{"id":
> 567}'::jsonb)
>    ->  Bitmap Index Scan on docs_doc_json_idx  (cost=0.00..28.75
> rows=100 width=0)
>          Index Cond: ((encode(doc, 'escape'::text))::jsonb @> '{"id":
> 567}'::jsonb)
>
> In this way, you can:
>
> - Keep the key ordering preserved and perform any integrity checks, etc.
> that your application requires
> - Limit your data duplication to that of the index
> - Still get the benefits of the JSONB lookup functions that work with
> the indexing
> - Still perform JSON validation:
>
> INSERT INTO docs VALUES ('{]'::bytea);
>
> ERROR:  invalid input syntax for type json
> DETAIL:  Expected string or "}", but found "]".
> CONTEXT:  JSON data, line 1: {]
>
>> My suggestion for that sentence:
>>
>> "In general, most applications should prefer to store JSON data as jsonb,
>> unless there are quite specialized needs, such as assumptions about ordering
>> of object keys or the need to retrieve the data exactly as it was stored."
> My preference would be that we guide in the documentation on what to do
> if one has an application sensitive to ordering. I'm not opposed to the
> wording, but I'd prefer we encourage people to leverage JSONB for
> storage & retrieval.
>
> Thanks!
>
> Jonathan
>
> [1] https://tools.ietf.org/html/rfc7159#section-4
>