Re: Reproducible coliisions in jsonb_hash - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Reproducible coliisions in jsonb_hash
Date
Msg-id CA+Tgmob_=6+KfesBKkdeORZDnwXdY_wrYroSrLEKXOV3rruH0A@mail.gmail.com
Whole thread Raw
In response to Re: Reproducible coliisions in jsonb_hash  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Reproducible coliisions in jsonb_hash  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, May 12, 2022 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 2022-05-12 Th 07:02, Valeriy Meleshkin wrote:
> >> AFAICT it happens because when iterating over a jsonb the hash function makes no distinction between raw scalars
andarrays (it doesn't inspect v.val.array.rawScalar)
 
>
> > It does look rather like a bug, but I'm unclear about the implications
> > of fixing it.
>
> Changing this hash algorithm would break existing hash indexes on jsonb
> columns.  Maybe there aren't any, but even if so I can't get very excited
> about changing this.  Hash algorithms always have collisions, and we have
> never made any promise that ours are cryptographically strong.

I might be missing something, but I don't know why "cryptographically
strong" is the relevant concept here. It seems like the question is
how likely it is that this would lead to queries having crappy
performance. In general we want hash functions to deliver different
values for different inputs so that we give ourselves the best
possible chance of spreading keys evenly across buckets. If for
example we hashed every possible JSON object to the same constant
value, everything we tried to do with this hash function would suck,
and the problem would be so severe that I think we'd have to fix it,
even though it would mean a compatibility break. Or even if we hashed
every JSON integer to the same value, that would be horrible, because
it's quite likely that you could have a column full of json objects
where ignoring the difference between one integer and another results
in a ton of duplicate hash values.

Here, that doesn't seem too likely. You could have a column that
contains 'tom' and ['tom'] and [['tom']] and [[['tom']]] and so forth
and they all get mapped onto the same bucket and you're sad. But
probably not. So I'd judge that while it was probably a mistake to
make the hash function work this way, it's not likely to cause serious
problems, and therefore we ought to maybe leave it alone for now, but
add a comment so that if we ever break backward-compatibility for any
other reason, we remember to fix this too.

IOW, I think I mostly agree with your conclusion, but perhaps not
entirely with the reasoning.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: gitmaster access
Next
From: vignesh C
Date:
Subject: Re: First draft of the PG 15 release notes