Thread: Reproducible coliisions in jsonb_hash

Reproducible coliisions in jsonb_hash

From
Valeriy Meleshkin
Date:
Hello,

I've noticed that

jsonb_hash_extended(jsonb_v,0) = jsonb_hash_extended(jsonb_build_array(jsonb_v),0)

for any jsonb value jsonb_v.

AFAICT it happens because when iterating over a jsonb the hash function makes no distinction between raw scalars and
arrays(it doesn't inspect v.val.array.rawScalar) 

https://github.com/postgres/postgres/blob/27b77ecf9f4d5be211900eda54d8155ada50d696/src/backend/utils/adt/jsonb_op.c#L326

Is this an intended behaviour or a bug?

Cheers,
Valeriy







Re: Reproducible coliisions in jsonb_hash

From
Andrew Dunstan
Date:
On 2022-05-12 Th 07:02, Valeriy Meleshkin wrote:
> Hello,
>
> I've noticed that 
>
> jsonb_hash_extended(jsonb_v,0) = jsonb_hash_extended(jsonb_build_array(jsonb_v),0)
>
> for any jsonb value jsonb_v. 
>
> AFAICT it happens because when iterating over a jsonb the hash function makes no distinction between raw scalars and
arrays(it doesn't inspect v.val.array.rawScalar)
 
>
https://github.com/postgres/postgres/blob/27b77ecf9f4d5be211900eda54d8155ada50d696/src/backend/utils/adt/jsonb_op.c#L326
>
> Is this an intended behaviour or a bug?
>

It does look rather like a bug, but I'm unclear about the implications
of fixing it.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Reproducible coliisions in jsonb_hash

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2022-05-12 Th 07:02, Valeriy Meleshkin wrote:
>> AFAICT it happens because when iterating over a jsonb the hash function makes no distinction between raw scalars and
arrays(it doesn't inspect v.val.array.rawScalar) 

> It does look rather like a bug, but I'm unclear about the implications
> of fixing it.

Changing this hash algorithm would break existing hash indexes on jsonb
columns.  Maybe there aren't any, but even if so I can't get very excited
about changing this.  Hash algorithms always have collisions, and we have
never made any promise that ours are cryptographically strong.

            regards, tom lane



Re: Reproducible coliisions in jsonb_hash

From
Robert Haas
Date:
On Thu, May 12, 2022 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 2022-05-12 Th 07:02, Valeriy Meleshkin wrote:
> >> AFAICT it happens because when iterating over a jsonb the hash function makes no distinction between raw scalars
andarrays (it doesn't inspect v.val.array.rawScalar)
 
>
> > It does look rather like a bug, but I'm unclear about the implications
> > of fixing it.
>
> Changing this hash algorithm would break existing hash indexes on jsonb
> columns.  Maybe there aren't any, but even if so I can't get very excited
> about changing this.  Hash algorithms always have collisions, and we have
> never made any promise that ours are cryptographically strong.

I might be missing something, but I don't know why "cryptographically
strong" is the relevant concept here. It seems like the question is
how likely it is that this would lead to queries having crappy
performance. In general we want hash functions to deliver different
values for different inputs so that we give ourselves the best
possible chance of spreading keys evenly across buckets. If for
example we hashed every possible JSON object to the same constant
value, everything we tried to do with this hash function would suck,
and the problem would be so severe that I think we'd have to fix it,
even though it would mean a compatibility break. Or even if we hashed
every JSON integer to the same value, that would be horrible, because
it's quite likely that you could have a column full of json objects
where ignoring the difference between one integer and another results
in a ton of duplicate hash values.

Here, that doesn't seem too likely. You could have a column that
contains 'tom' and ['tom'] and [['tom']] and [[['tom']]] and so forth
and they all get mapped onto the same bucket and you're sad. But
probably not. So I'd judge that while it was probably a mistake to
make the hash function work this way, it's not likely to cause serious
problems, and therefore we ought to maybe leave it alone for now, but
add a comment so that if we ever break backward-compatibility for any
other reason, we remember to fix this too.

IOW, I think I mostly agree with your conclusion, but perhaps not
entirely with the reasoning.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Reproducible coliisions in jsonb_hash

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Here, that doesn't seem too likely. You could have a column that
> contains 'tom' and ['tom'] and [['tom']] and [[['tom']]] and so forth
> and they all get mapped onto the same bucket and you're sad. But
> probably not.

Yeah, that might be a more useful way to think about it: is this likely
to cause performance-critical collisions in practice?  I agree that
that doesn't seem like a very likely situation, even given that you
might be using json for erratically-structured data.

            regards, tom lane



Re: Reproducible coliisions in jsonb_hash

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Here, that doesn't seem too likely. You could have a column that
> > contains 'tom' and ['tom'] and [['tom']] and [[['tom']]] and so forth
> > and they all get mapped onto the same bucket and you're sad. But
> > probably not.
>
> Yeah, that might be a more useful way to think about it: is this likely
> to cause performance-critical collisions in practice?  I agree that
> that doesn't seem like a very likely situation, even given that you
> might be using json for erratically-structured data.

Particularly for something like jsonb (but maybe other things?) having a
hash function that could be user-defined or at least have some options
seems like it would be quite nice (similar to compression...).  If we
were to go in the direction of changing this, I'd suggest that we try to
make it something where the existing function could still be used while
also allowing a new one to be used.  More flexibility would be even
better, of course (column-specific hash functions comes to mind...).

Agreed with the general conclusion here also, just wanted to share some
thoughts on possible future directions to go in.

Thanks,

Stephen

Attachment