Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Date
Msg-id CAA4eK1JNUi_dpu2AF1rNRU+S1X3YF3OwxogFiMj8q4VXdPjO-A@mail.gmail.com
Whole thread Raw
In response to Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Mon, Dec 14, 2020 at 1:36 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
> > But what jumps out at me here is that this sort of error seems way
> > too easy to make, and evidently way too hard to detect.  What can we
> > do to make it more obvious if one has incorrectly used or omitted
> > HASH_BLOBS?  Both directions of error might easily escape notice on
> > little-endian hardware.
> >
> > I thought of a few ideas, all of which have drawbacks:
> >
> > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
> > This seems to just move the problem somewhere else, besides which
> > it'd require touching an awful lot of callers, and would silently
> > break third-party callers.
> >
> > 2. Don't allow a default: invent a new HASH_STRING flag, and
> > require that hash_create() calls specify exactly one of HASH_BLOBS,
> > HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> > hazard of mindless-copy-and-paste, but I think it might make it
> > a little more obvious.  Still requires touching a lot of calls.
>
> I like (2), for making the bug harder and for greppability.  Probably
> pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.
>
> > 3. Add some sort of heuristic restriction on keysize.  A keysize
> > that's only 4 or 8 bytes almost certainly is not a string.
> > This doesn't give us much traction for larger keysizes, though.
> >
> > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
> > in string_hash().  I think we could get away with that given that
> > SQL disallows empty identifiers.  However, it would only help to
> > catch one direction of error (omitting HASH_BLOBS), and it would
> > only help on big-endian hardware, which is getting harder to find.
> > Still, we could hope that the buildfarm would detect errors.
>
> It's nontrivial to confirm that the empty-string key can't happen for a given
> hash table.  (In contrast, what (3) asserts on is usually a compile-time
> constant.)  I would stop short of adding (4), though it could be okay.
>
> > A quick count of grep hits suggest that the large majority of
> > existing hash_create() calls use HASH_BLOBS, and there might be
> > only order-of-ten calls that would need to be touched if we
> > required an explicit HASH_STRING flag.  So option #2 is seeming
> > kind of attractive.  Maybe that together with an assertion that
> > string keys have to exceed 8 or 16 bytes would be enough protection.
>
> Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction.  Requiring more bytes trades specificity for sensitivity.
>

+1. I also think in most cases (2) would be sufficient to avoid such
bugs. Adding restriction on string size might annoy some out-of-core
user which is already using small strings. However, adding an 8-byte
restriction on string size would be still okay.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Next
From: Bharath Rupireddy
Date:
Subject: Re: Parallel Inserts in CREATE TABLE AS