Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) |
Date | |
Msg-id | 20201213200608.GA2309473@rfd.leadboat.com Whole thread Raw |
In response to | HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) |
List | pgsql-hackers |
On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote: > But what jumps out at me here is that this sort of error seems way > too easy to make, and evidently way too hard to detect. What can we > do to make it more obvious if one has incorrectly used or omitted > HASH_BLOBS? Both directions of error might easily escape notice on > little-endian hardware. > > I thought of a few ideas, all of which have drawbacks: > > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. > This seems to just move the problem somewhere else, besides which > it'd require touching an awful lot of callers, and would silently > break third-party callers. > > 2. Don't allow a default: invent a new HASH_STRING flag, and > require that hash_create() calls specify exactly one of HASH_BLOBS, > HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the > hazard of mindless-copy-and-paste, but I think it might make it > a little more obvious. Still requires touching a lot of calls. I like (2), for making the bug harder and for greppability. Probably pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS. > 3. Add some sort of heuristic restriction on keysize. A keysize > that's only 4 or 8 bytes almost certainly is not a string. > This doesn't give us much traction for larger keysizes, though. > > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" > in string_hash(). I think we could get away with that given that > SQL disallows empty identifiers. However, it would only help to > catch one direction of error (omitting HASH_BLOBS), and it would > only help on big-endian hardware, which is getting harder to find. > Still, we could hope that the buildfarm would detect errors. It's nontrivial to confirm that the empty-string key can't happen for a given hash table. (In contrast, what (3) asserts on is usually a compile-time constant.) I would stop short of adding (4), though it could be okay. > A quick count of grep hits suggest that the large majority of > existing hash_create() calls use HASH_BLOBS, and there might be > only order-of-ten calls that would need to be touched if we > required an explicit HASH_STRING flag. So option #2 is seeming > kind of attractive. Maybe that together with an assertion that > string keys have to exceed 8 or 16 bytes would be enough protection. Agreed. I expect (2) gives most of the benefit. Requiring 8-byte capacity should be harmless, and most architectures can zero 8 bytes in one instruction. Requiring more bytes trades specificity for sensitivity. > A different angle we could think about is that the name "HASH_BLOBS" > is kind of un-obvious. Maybe we should deprecate that spelling in > favor of something like "HASH_BINARY". With (2) in place, I wouldn't worry about renaming HASH_BLOBS. It's hard to confuse with HASH_STRINGS or HASH_FUNCTION. If anything, HASH_BLOBS conveys something more specific. HASH_FUNCTION cases see binary data, but that data has structure that promotes it out of "blob" status.
pgsql-hackers by date: