Re: jsonb and nested hstore - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: jsonb and nested hstore
Date
Msg-id CAM3SWZRase2g5Yyhg+vt1dzv85jmfyauE7A7yFBdToEULwojag@mail.gmail.com
Whole thread Raw
In response to Re: jsonb and nested hstore  (Bruce Momjian <bruce@momjian.us>)
Responses Re: jsonb and nested hstore
Re: jsonb and nested hstore
Re: jsonb and nested hstore
Re: jsonb and nested hstore
List pgsql-hackers
On Fri, Mar 7, 2014 at 9:00 AM, Bruce Momjian <bruce@momjian.us> wrote:
> OK, it sounds like the adjustments are minimal, like not using the
> high-order bit.

Attached patch is a refinement of the work of Oleg, Teodor and Andrew.
Revisions are mostly my own, although Andrew contributed too.

Changes include:

* Extensive relocation, and moderate restructuring of code. Many
comments added, while many existing comments were copy-edited. Nothing
remains in contrib. jsonb is a distinct, in-core type with no
user-visible relationship to hstore. There is no code dependency
between the two. The amount of code redundancy this turned out to
create (between jsonb and an unchanged hstore) is, in my estimation,
quite acceptable.

* B-Tree and hash operator classes for the core type are included. A
GiST operator class, and two GIN operator classes are also included.
Obviously this is where I spent most time by far.

* Everything else that was in hstore in the last revision (the
complement of the hstore2 opclasses) is removed entirely. The patch is
much smaller. If we just consider code (excluding tests and
documentation), the diffstat seems far more manageable:

***SNIP***
 src/backend/catalog/system_views.sql     |    8 +
 src/backend/utils/adt/Makefile           |   10 +-
 src/backend/utils/adt/json.c             |   42 +-
 src/backend/utils/adt/jsonb.c            |  461 +++++++++++
 src/backend/utils/adt/jsonb_gin.c        |  471 +++++++++++
 src/backend/utils/adt/jsonb_gist.c       |  699 ++++++++++++++++
 src/backend/utils/adt/jsonb_op.c         |  565 +++++++++++++
 src/backend/utils/adt/jsonb_support.c    | 1350 +++++++++++++++++++++++++++++++
 src/backend/utils/adt/jsonfuncs.c        | 1155 +++++++++++++++++++++++---
 src/backend/utils/adt/numeric.c          |   39 +
 src/include/catalog/pg_amop.h            |   36 +
 src/include/catalog/pg_amproc.h          |   18 +-
 src/include/catalog/pg_cast.h            |    4 +
 src/include/catalog/pg_opclass.h         |    5 +
 src/include/catalog/pg_operator.h        |   37 +-
 src/include/catalog/pg_opfamily.h        |    5 +
 src/include/catalog/pg_proc.h            |  103 ++-
 src/include/catalog/pg_type.h            |    8 +
 src/include/funcapi.h                    |    9 +
 src/include/utils/json.h                 |   15 +
 src/include/utils/jsonapi.h              |    8 +-
 src/include/utils/jsonb.h                |  245 ++++++
 src/include/utils/numeric.h              |    1 +
***SNIP***

That's less than 5,000 lines of code, which is actually not terribly
much for a new datatype with voluminous though simple code to
implement each of the new operators. I would like to emphasize that
100% of changes to C files occur at src/backend/utils/adt/* .

On reflection I realize that I didn't like the "type aware" operators
that existed for hstore in prior revisions, and I don't have a sense
that I've made cuts to the patch to reduce the code footprint that I'd
not have made with no constraints on time. While there are one or two
exceptions to this, surprisingly, that's all. Frequently, it's
sufficient to interact with jsonb using the jsonb shadow type system,
and where it isn't then I think that shadow-type-specific operators
are the wrong way to go.

* Extensive additional documentation. References to the very new JSON
RFC. I think that this revision is in general a lot more coherent, and
I found that reflecting on what idiomatic usage should look like while
writing the documentation brought clarity to my thoughts on how the
code should be structured. The documentation is worth a read if you
want to get a better sense of what the patch is about relatively
quickly.

* Operators for jsonb that comprise the default B-Tree operator class
no longer look strange (i.e., they're no longer deliberately ugly to
discourage their use, because unlike with hstore this doesn't make
sense, particularly when sorting jsonb that only the shadow type
system knows to be "raw strings").

* Numerous bug fixes.

* Since preserving on-disk compatibility with hstore1 is no longer an
objective, I was able to remove some code there.

* The jsonb type has test coverage more or less equivalent to the
previous revision where the now-expunged hstore2 type's output was
tested (though there are some more tests that I myself added towards
the end of this past week as part of the process of fixing bugs). I
thank my Heroku colleague Maciek Sakrejda for working on this. This
was largely a mechanical process.

Concerns:

* I would like to have a lot more comments on the GiST code from its
authors, because it's currently quite difficult to follow. A couple of
the bigger GiST support routines currently have zero useful comments.
I would deem it acceptable to lose GiST support entirely for now, if
the 700 line saving made it possible to squeeze this into 9.4. ISTM
that the two GIN operator classes are far more useful, and their code
is quite a bit simpler (apparently the state of the art here is a
system that can query JSON-like structures reasonably well, but has a
global write lock, which to me suggests that GIN represents a better
trade-off).

* That GiST code has bugs, too. I ran out of steam for fixing it
before posting. The basic symptom is I saw is that when creating a
large GiST index, sometimes you'll hit an infinite loop here:

warning: no loadable sections found in added symbol-file
system-supplied DSO at 0x7fffc70eb000
0x0000000000829de5 in hemdistsign (a=0x7fd6c833b260
"*\342\060\331\351b\201\250\002\060\212)\304\337\a*", b=0x17084f0 "")
at jsonb_gist.c:655
655            if (GETBIT(a, i) != GETBIT(b, i))
(gdb)
(gdb)
(gdb) n
656                dist++;
(gdb)
653        LOOPBIT
(gdb)
655            if (GETBIT(a, i) != GETBIT(b, i))
(gdb)
653        LOOPBIT

It's possible that I regress things in a way not captured by the
regression test, and that this is my fault, but whatever the case
let's either get it fixed or cut it for now.

* The jsonb_hash_ops non-default GIN opclass is broken. It has its own
idiosyncratic notion of what constitutes containment, that sees it
only return, say, jsonb arrays that have a matching string as their
"leftmost" element (if we ask it if it contains within it another
array with the same string). Because of the limited number of
indexable operators (only @>), I'd put this opclass in the same
category as GiST in terms of my willingness to forgo it for a release,
even if it did receive a loud applause at pgConf.EU. Again, it might
be some disparity between the opertors as they existed in hstore2 at
one time, and as they exist in the core code now, but I doubt it, not
least since the regression tests didn't pick this up, and it's such a
basic thing. Perhaps Oleg and Teodor just need to explain this to me.

* More generally, I do not understand GiST and GIN to the same extent
as several other people (even after excluding the authors of this
patch). Feedback on those aspects from others would be particularly
useful at this point. Is the text-based storage format (i.e. the
serialization logic for the GiST and GIN default operator classes)
appropriate? I tweaked the serialization logic here a bit, in order to
fix a bug. Overall, most bugs fixed were along the lines of "index
scans are not in agreement with sequential scans", and most fixes were
simple enough. Roughly speaking, on a few occasions opclass support
code needed to be reconciled with jsonb code (or existing core code
that it's implemented in terms of). None of these bugs were difficult
to fix, or in any way scary. Still, as I mentioned there are some open
items here.

* I haven't given the I/O code much attention, nor much of the code
written by Andrew. In general I haven't done all that much with the
parts of the patch that were previously proposed for core inclusion.

* I could have spent more time on testing and breaking things some
more, but didn't want to block on that. The more scrutiny this gets
the better.

Thoughts?
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Row-security on updatable s.b. views
Next
From: Alexander Korotkov
Date:
Subject: Re: jsonb and nested hstore