HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) |
Date | |
Msg-id | 590625.1607878171@sss.pgh.pa.us Whole thread Raw |
In response to | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)
Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions) |
List | pgsql-hackers |
Amit Kapila <amit.kapila16@gmail.com> writes: > On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote: >> The problem is xidhash using strcmp() to compare keys; it needs memcmp(). > Your analysis is correct. Sorry for not having noticed this thread before. Noah's fix is clearly correct, and I have no objection to the added test case. But what jumps out at me here is that this sort of error seems way too easy to make, and evidently way too hard to detect. What can we do to make it more obvious if one has incorrectly used or omitted HASH_BLOBS? Both directions of error might easily escape notice on little-endian hardware. I thought of a few ideas, all of which have drawbacks: 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default. This seems to just move the problem somewhere else, besides which it'd require touching an awful lot of callers, and would silently break third-party callers. 2. Don't allow a default: invent a new HASH_STRING flag, and require that hash_create() calls specify exactly one of HASH_BLOBS, HASH_STRING, or HASH_FUNCTION. This doesn't completely fix the hazard of mindless-copy-and-paste, but I think it might make it a little more obvious. Still requires touching a lot of calls. 3. Add some sort of heuristic restriction on keysize. A keysize that's only 4 or 8 bytes almost certainly is not a string. This doesn't give us much traction for larger keysizes, though. 4. Disallow empty string keys, ie something like "Assert(s_len > 0)" in string_hash(). I think we could get away with that given that SQL disallows empty identifiers. However, it would only help to catch one direction of error (omitting HASH_BLOBS), and it would only help on big-endian hardware, which is getting harder to find. Still, we could hope that the buildfarm would detect errors. There might be some more options. Also, some of these ideas could be applied in combination. A quick count of grep hits suggest that the large majority of existing hash_create() calls use HASH_BLOBS, and there might be only order-of-ten calls that would need to be touched if we required an explicit HASH_STRING flag. So option #2 is seeming kind of attractive. Maybe that together with an assertion that string keys have to exceed 8 or 16 bytes would be enough protection. Also, this census now suggests to me that the opposite problem (copy-and-paste HASH_BLOBS when you meant string keys) might be a real hazard, since so many of the existing prototypes that you might copy have HASH_BLOBS. I'm not sure if there's much to be done for this case though. A small saving grace is that it seems relatively likely that you'd notice a functional problem pretty quickly with this type of mistake, since lookups would tend to fail due to trailing garbage after your lookup string. A different angle we could think about is that the name "HASH_BLOBS" is kind of un-obvious. Maybe we should deprecate that spelling in favor of something like "HASH_BINARY". Thoughts? regards, tom lane
pgsql-hackers by date: