Thread: Brain fade in gin_extract_jsonb_path()

Brain fade in gin_extract_jsonb_path()

From
Tom Lane
Date:
I looked into bug #13756,
http://www.postgresql.org/message-id/20151105171933.14035.25039@wrigleys.postgresql.org

The cause of the problem is that gin_extract_jsonb_path() computes
a different hash for the "2" in
    {"a":[ ["b",{"x":1}], ["b",{"x":2}]]}
than it does for the "2" in
    {"a":[[{"x":2}]]}
And the cause of that is that it supposes that after emitting a hash
for a VALUE, it can just leave stack->hash to be reinitialized by the
next KEY or ELEM.  But if the next thing is a sub-object, we propagate
the previous value's hash down into the sub-object, causing values
therein to receive different hashes than they'd have gotten without
the preceding outer-level VALUE.

The attached one-liner fixes it, but of course this means that on-disk
jsonb_path_ops indexes are possibly broken and will need to be reindexed.
I see no way around that ... does anybody else?

            regards, tom lane

diff --git a/src/backend/utils/adt/jsonb_gin.c b/src/backend/utils/adt/jsonb_gin.c
index 204fb8b..b917ec2 100644
*** a/src/backend/utils/adt/jsonb_gin.c
--- b/src/backend/utils/adt/jsonb_gin.c
*************** gin_extract_jsonb_path(PG_FUNCTION_ARGS)
*** 419,425 ****
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* and emit an index entry */
                  entries[i++] = UInt32GetDatum(stack->hash);
!                 /* Note: we assume we'll see KEY before another VALUE */
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:
--- 419,426 ----
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* and emit an index entry */
                  entries[i++] = UInt32GetDatum(stack->hash);
!                 /* reset hash for next sub-object */
!                 stack->hash = stack->parent->hash;
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:

Re: Brain fade in gin_extract_jsonb_path()

From
Peter Geoghegan
Date:
On Thu, Nov 5, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The attached one-liner fixes it, but of course this means that on-disk
> jsonb_path_ops indexes are possibly broken and will need to be reindexed.
> I see no way around that ... does anybody else?

I think it's impossible to avoid having to recommend a reindex here.

-- 
Peter Geoghegan



Re: Brain fade in gin_extract_jsonb_path()

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Nov 5, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The attached one-liner fixes it, but of course this means that on-disk
>> jsonb_path_ops indexes are possibly broken and will need to be reindexed.
>> I see no way around that ... does anybody else?

> I think it's impossible to avoid having to recommend a reindex here.

Good, because that's not the only bug here :-(

Extending the example with

insert into t (j) values ('[[14,2,3]]');
insert into t (j) values ('[1,[14,2,3]]');

we get this with seqscan on:

regression=# select * from t where j @> '[[14]]';
        j
-----------------
 [[14, 2, 3]]
 [1, [14, 2, 3]]
(2 rows)

which is the right answer, and this with seqscan off:

regression=# select * from t where j @> '[[14]]';
      j
--------------
 [[14, 2, 3]]
(1 row)

which is not so much.

The reason for that is that when processing an outer-level WJB_BEGIN_XXX,
we initialize stack->hash to something that is different from
stack->parent->hash.  This is a waste of time in unnested cases because
when we hit the first WJB_KEY or WJB_ELEM, we'll reset stack->hash to
stack->parent->hash.  However, if we have a nested array, then processing
of elements inside that array treats stack->parent->hash as the reference
value, so that we compute different hash values than we would have at
outer level.  What's worse, it matters whether the nested array is the
first subobject or not, as shown by the above example; that's because once
we reset stack->hash to equal stack->parent->hash, hashes for later
elements will be inconsistent with the special-case initialization.
(Note: that explains why this is broken with my previous patch.  It fails
in much the same way with unmodified HEAD, but I think for slightly
different reasons, which I've not bothered to understand in detail.)

I think the easiest way to fix this is to forget about the special
initialization at outer level and just always initialize a new stack level
to have the same hash as its parent.  That leads to the first patch below
--- but once you look at that, you realize that we've got unnecessarily
many copies of stack->parent->hash into stack->hash, so what I actually
propose now is the second patch below.

I think that this only changes the hashes calculated for nested-array
cases, but it's still a larger change footprint than the previous patch.
Not that that affects much; a reindex is a reindex, and a bug is a bug,
so we don't have much wiggle room.

            regards, tom lane

diff --git a/src/backend/utils/adt/jsonb_gin.c b/src/backend/utils/adt/jsonb_gin.c
index 204fb8b..9172268 100644
*** a/src/backend/utils/adt/jsonb_gin.c
--- b/src/backend/utils/adt/jsonb_gin.c
*************** gin_extract_jsonb_path(PG_FUNCTION_ARGS)
*** 375,406 ****
                  parent = stack;
                  stack = (PathHashStack *) palloc(sizeof(PathHashStack));

!                 if (parent->parent)
!                 {
!                     /*
!                      * We pass forward hashes from previous container nesting
!                      * levels so that nested arrays with an outermost nested
!                      * object will have element hashes mixed with the
!                      * outermost key.  It's also somewhat useful to have
!                      * nested objects' innermost values have hashes that are a
!                      * function of not just their own key, but outer keys too.
!                      *
!                      * Nesting an array within another array will not alter
!                      * innermost scalar element hash values, but that seems
!                      * inconsequential.
!                      */
!                     stack->hash = parent->hash;
!                 }
!                 else
!                 {
!                     /*
!                      * At the outermost level, initialize hash with container
!                      * type proxy value.  Note that this makes JB_FARRAY and
!                      * JB_FOBJECT part of the on-disk representation, but they
!                      * are that in the base jsonb object storage already.
!                      */
!                     stack->hash = (r == WJB_BEGIN_ARRAY) ? JB_FARRAY : JB_FOBJECT;
!                 }
                  stack->parent = parent;
                  break;
              case WJB_KEY:
--- 375,390 ----
                  parent = stack;
                  stack = (PathHashStack *) palloc(sizeof(PathHashStack));

!                 /*
!                  * We pass forward hashes from outer nesting levels so that
!                  * the hashes for nested values will include outer keys as
!                  * well as their own keys.
!                  *
!                  * Nesting an array within another array will not alter
!                  * innermost scalar element hash values, but that seems
!                  * inconsequential.
!                  */
!                 stack->hash = parent->hash;
                  stack->parent = parent;
                  break;
              case WJB_KEY:
*************** gin_extract_jsonb_path(PG_FUNCTION_ARGS)
*** 419,425 ****
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* and emit an index entry */
                  entries[i++] = UInt32GetDatum(stack->hash);
!                 /* Note: we assume we'll see KEY before another VALUE */
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:
--- 403,410 ----
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* and emit an index entry */
                  entries[i++] = UInt32GetDatum(stack->hash);
!                 /* reset hash for next sub-object */
!                 stack->hash = stack->parent->hash;
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:
diff --git a/src/backend/utils/adt/jsonb_gin.c b/src/backend/utils/adt/jsonb_gin.c
index 204fb8b..04d2318 100644
*** a/src/backend/utils/adt/jsonb_gin.c
--- b/src/backend/utils/adt/jsonb_gin.c
*************** gin_extract_jsonb_path(PG_FUNCTION_ARGS)
*** 375,425 ****
                  parent = stack;
                  stack = (PathHashStack *) palloc(sizeof(PathHashStack));

!                 if (parent->parent)
!                 {
!                     /*
!                      * We pass forward hashes from previous container nesting
!                      * levels so that nested arrays with an outermost nested
!                      * object will have element hashes mixed with the
!                      * outermost key.  It's also somewhat useful to have
!                      * nested objects' innermost values have hashes that are a
!                      * function of not just their own key, but outer keys too.
!                      *
!                      * Nesting an array within another array will not alter
!                      * innermost scalar element hash values, but that seems
!                      * inconsequential.
!                      */
!                     stack->hash = parent->hash;
!                 }
!                 else
!                 {
!                     /*
!                      * At the outermost level, initialize hash with container
!                      * type proxy value.  Note that this makes JB_FARRAY and
!                      * JB_FOBJECT part of the on-disk representation, but they
!                      * are that in the base jsonb object storage already.
!                      */
!                     stack->hash = (r == WJB_BEGIN_ARRAY) ? JB_FARRAY : JB_FOBJECT;
!                 }
                  stack->parent = parent;
                  break;
              case WJB_KEY:
!                 /* initialize hash from parent */
!                 stack->hash = stack->parent->hash;
!                 /* and mix in this key */
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* hash is now ready to incorporate the value */
                  break;
              case WJB_ELEM:
-                 /* array elements use parent hash mixed with element's hash */
-                 stack->hash = stack->parent->hash;
-                 /* FALL THRU */
              case WJB_VALUE:
                  /* mix the element or value's hash into the prepared hash */
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* and emit an index entry */
                  entries[i++] = UInt32GetDatum(stack->hash);
!                 /* Note: we assume we'll see KEY before another VALUE */
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:
--- 375,405 ----
                  parent = stack;
                  stack = (PathHashStack *) palloc(sizeof(PathHashStack));

!                 /*
!                  * We pass forward hashes from outer nesting levels so that
!                  * the hashes for nested values will include outer keys as
!                  * well as their own keys.
!                  *
!                  * Nesting an array within another array will not alter
!                  * innermost scalar element hash values, but that seems
!                  * inconsequential.
!                  */
!                 stack->hash = parent->hash;
                  stack->parent = parent;
                  break;
              case WJB_KEY:
!                 /* mix this key into the current outer hash */
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* hash is now ready to incorporate the value */
                  break;
              case WJB_ELEM:
              case WJB_VALUE:
                  /* mix the element or value's hash into the prepared hash */
                  JsonbHashScalarValue(&v, &stack->hash);
                  /* and emit an index entry */
                  entries[i++] = UInt32GetDatum(stack->hash);
!                 /* reset hash for next key, value, or sub-object */
!                 stack->hash = stack->parent->hash;
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:

Re: Brain fade in gin_extract_jsonb_path()

From
Tom Lane
Date:
I wrote:
> I think the easiest way to fix this is to forget about the special
> initialization at outer level and just always initialize a new stack level
> to have the same hash as its parent.  That leads to the first patch below
> --- but once you look at that, you realize that we've got unnecessarily
> many copies of stack->parent->hash into stack->hash, so what I actually
> propose now is the second patch below.

Meh, that wasn't quite right either --- need to reset stack->hash after
processing a sub-object, not only after a scalar VALUE.  I think what I
committed is OK though.
        regards, tom lane