Re: Wanted: jsonb on-disk representation documentation - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Wanted: jsonb on-disk representation documentation |
Date | |
Msg-id | 3813.1399493613@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Wanted: jsonb on-disk representation documentation (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Wanted: jsonb on-disk representation documentation
|
List | pgsql-hackers |
Peter Geoghegan <pg@heroku.com> writes: > On Wed, May 7, 2014 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The jsonb_ops storage format for values is inherently lossy, because it >> cannot distinguish the string values "n", "t", "f" from JSON null or >> boolean true, false respectively; nor does it know the difference between >> a number and a string containing digits. This appears to not quite be a >> bug because the consistent functions force recheck for all queries that >> care about values (as opposed to keys). But if it's documented anywhere >> I don't see where. > The fact that we *don't* set the reset flag for > JsonbExistsStrategyNumber, and why that's okay is prominently > documented. So I'd say that it is. Meh. Are you talking about the large comment block in gin_extract_jsonb? The readability of that comment starts to go downhill with its use of "reset" to refer to what everything else calls a "recheck" flag, and in any case it's claiming that we *don't* need a recheck for exists (a statement I suspect to be false, but more later). It entirely fails to explain that other query types *do* need a recheck because of arbitrary decisions about not representing JSON datatype information fully. There's another comment in gin_consistent_jsonb that's just as misleading, because it mentions (vaguely) some reasons why recheck is necessary without mentioning this one. A larger issue here is that, to the extent that this comment even has the information I'm after, the comment is in the wrong place. It is not attached to the code that's actually making the lossy representational choices (ie, make_scalar_key), nor to the code that decides whether or not a recheck is needed (ie, gin_consistent_jsonb). I don't think that basic architectural choices like these should be relegated to comment blocks inside specific functions to begin with. A README file would be better, perhaps, but there's not a specific directory associated with the jsonb code; so I think this sort of info belongs either in jsonb.h or in the file header comment for jsonb_gin.c. > Anyway, doing things that way for values won't obviate the need to set > the reset flag, unless you come up with a much more sophisticated > scheme. Existence (of keys) is only tested in respect of the > top-level. Containment (where values are tested) is more complicated. I'm not expecting that it will make things better for the current query operators. I am thinking that exact rather than lossy storage might help for future operators wanting to use this same index representation. Once we ship 9.4, it's going to be very hard to change the index representation, especially for the default opclass (cf the business about which opclass is default for type inet). So it behooves us to not throw information away needlessly. regards, tom lane
pgsql-hackers by date: