On Tue, Dec 5, 2023 at 1:57 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2023-12-04 at 12:12 +0700, John Naylor wrote:
> There's already a patch to use simplehash, and the API is a bit
> cleaner, and there's a minor performance improvement. It seems fairly
> non-controversial -- should I just proceed with that patch?
I won't object if you want to commit that piece now, but I hesitate to
call it a performance improvement on its own.
- The runtime measurements I saw reported were well within the noise level.
- The memory usage starts out better, but with more entries is worse.
> > From my point of view, it would at least be useful for C-strings,
> > where we don't have the length available up front.
>
> That's good news.
>
> By the way, is there any reason that we would need hash_bytes(s,
> strlen(s)) == cstring_hash(s)?
"git grep cstring_hash" found nothing, so not sure what you're asking.
> Each approach has its own optimization techniques. In (a), we can use
> the |= 0x20 trick, and for equality do a memcmp() check first.
I will assume you are referring to semantics, but on the odd chance
readers take this to mean the actual C library call, that wouldn't be
an optimization, that'd be a pessimization.
> As a tangential point, we may eventually want to provide a more
> internationalized definition of "case insensitive" for GUC names. That
> would be slightly easier with (b) than with (a), but we can cross that
> bridge if and when we come to it.
The risk/reward ratio seems pretty bad.
> It seems you are moving toward (a) whereas my patches moved toward (b).
> I am fine with either approach but I wanted to clarify which approach
> we are using.
I will make my case:
> In the abstract, I kind of like approach (b) because we don't need to
> be as special/clever with the hash functions.
In the abstract, I consider (b) to be a layering violation. As a
consequence, the cleverness in (b) is not confined to one or two
places, but is smeared over a whole bunch of places. I find it hard to
follow.
Concretely, it also adds another pointer to the element struct. That's
not good for a linear open-addressing array, which simplehash has.
Further, remember the equality function is important as well. In v3,
it was "strcmp(a,b)==0", which is a holdover from the dynahash API.
One of the advantages of the simplehash API is that we can 1) use an
equality function, which should be slightly cheaper than a full
comparison function, and 2) we have the option to inline it. (It
doesn't make sense in turn, to jump to a shared lib page and invoke an
indirect function call.) Once we've done that, it's already "special",
so it's not a stretch to make it do what we want to begin with. If a
nicer API is important, why not use it?