Re: Wanted: jsonb on-disk representation documentation - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Wanted: jsonb on-disk representation documentation
Date
Msg-id 3813.1399493613@sss.pgh.pa.us
Whole thread Raw
In response to Re: Wanted: jsonb on-disk representation documentation  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Wanted: jsonb on-disk representation documentation
List pgsql-hackers
Peter Geoghegan <pg@heroku.com> writes:
> On Wed, May 7, 2014 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The jsonb_ops storage format for values is inherently lossy, because it
>> cannot distinguish the string values "n", "t", "f" from JSON null or
>> boolean true, false respectively; nor does it know the difference between
>> a number and a string containing digits.  This appears to not quite be a
>> bug because the consistent functions force recheck for all queries that
>> care about values (as opposed to keys).  But if it's documented anywhere
>> I don't see where.

> The fact that we *don't* set the reset flag for
> JsonbExistsStrategyNumber, and why that's okay is prominently
> documented. So I'd say that it is.

Meh.  Are you talking about the large comment block in gin_extract_jsonb?
The readability of that comment starts to go downhill with its use of
"reset" to refer to what everything else calls a "recheck" flag, and in
any case it's claiming that we *don't* need a recheck for exists (a
statement I suspect to be false, but more later).  It entirely fails to
explain that other query types *do* need a recheck because of arbitrary
decisions about not representing JSON datatype information fully.  There's
another comment in gin_consistent_jsonb that's just as misleading, because
it mentions (vaguely) some reasons why recheck is necessary without
mentioning this one.

A larger issue here is that, to the extent that this comment even has
the information I'm after, the comment is in the wrong place.  It is not
attached to the code that's actually making the lossy representational
choices (ie, make_scalar_key), nor to the code that decides whether or not
a recheck is needed (ie, gin_consistent_jsonb).  I don't think that basic
architectural choices like these should be relegated to comment blocks
inside specific functions to begin with.  A README file would be better,
perhaps, but there's not a specific directory associated with the jsonb
code; so I think this sort of info belongs either in jsonb.h or in the
file header comment for jsonb_gin.c.

> Anyway, doing things that way for values won't obviate the need to set
> the reset flag, unless you come up with a much more sophisticated
> scheme. Existence (of keys) is only tested in respect of the
> top-level. Containment (where values are tested) is more complicated.

I'm not expecting that it will make things better for the current query
operators.  I am thinking that exact rather than lossy storage might help
for future operators wanting to use this same index representation.
Once we ship 9.4, it's going to be very hard to change the index
representation, especially for the default opclass (cf the business about
which opclass is default for type inet).  So it behooves us to not throw
information away needlessly.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: 9.4 checksum errors in recovery with gin index
Next
From: Robert Haas
Date:
Subject: Re: bgworker crashed or not?