Thread: Change GUC hashtable to use simplehash?
I had briefly experimented changing the hash table in guc.c to use simplehash. It didn't offer any measurable speedup, but the API is slightly nicer. I thought I'd post the patch in case others thought this was a good direction or nice cleanup. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Fri, Nov 17, 2023 at 11:02 AM Jeff Davis <pgsql@j-davis.com> wrote: > > I had briefly experimented changing the hash table in guc.c to use > simplehash. It didn't offer any measurable speedup, but the API is > slightly nicer. > > I thought I'd post the patch in case others thought this was a good > direction or nice cleanup. This is not a comment on the patch itself, but since GUC operations are not typically considered performance or space sensitive, this comment from simplehash.h makes a case against it. * It's probably not worthwhile to generate such a specialized implementation * for hash tables that aren't performance or space sensitive. But your argument of a nicer API might make a case for the patch. I Best regards, Gurjeet http://Gurje.et
On Fri, 2023-11-17 at 13:22 -0800, Gurjeet Singh wrote: > This is not a comment on the patch itself, but since GUC operations > are not typically considered performance or space sensitive, A "SET search_path" clause on a CREATE FUNCTION is a case for better performance in guc.c, because it repeatedly sets and rolls back the setting on each function invocation. Unfortunately, this patch doesn't really improve the performance. The reason the hash table in guc.c is slow is because of the case folding in both hashing and comparison. I might get around to fixing that, which could have a minor impact, and perhaps then the choice between hsearch/simplehash would matter. > this > comment from simplehash.h makes a case against it. > > * It's probably not worthwhile to generate such a specialized > implementation > * for hash tables that aren't performance or space sensitive. > > But your argument of a nicer API might make a case for the patch. Yeah, that's what I was thinking. simplehash is newer and has a nicer API, so if we like it and want to move more code over, this is one step. But if we are fine using both hsearch.h and simplehash.h for overlapping use cases indefinitely, then I'll drop this. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Fri, 2023-11-17 at 13:22 -0800, Gurjeet Singh wrote: >> But your argument of a nicer API might make a case for the patch. > Yeah, that's what I was thinking. simplehash is newer and has a nicer > API, so if we like it and want to move more code over, this is one > step. But if we are fine using both hsearch.h and simplehash.h for > overlapping use cases indefinitely, then I'll drop this. I can't imagine wanting to convert *every* hashtable in the system to simplehash; the added code bloat would be unreasonable. So yeah, I think we'll have two mechanisms indefinitely. That's not to say that we might not rewrite hsearch. But simplehash was never meant to be a universal solution. regards, tom lane
Hi, On 2023-11-17 13:44:21 -0800, Jeff Davis wrote: > On Fri, 2023-11-17 at 13:22 -0800, Gurjeet Singh wrote: > > This is not a comment on the patch itself, but since GUC operations > > are not typically considered performance or space sensitive, I don't think that's quite right - we have a lot of GUCs and they're loaded in each connection. And there's set/reset around transactions etc. So even without search path stuff that Jeff mentioned, it could be worth optimizing this. > Yeah, that's what I was thinking. simplehash is newer and has a nicer > API, so if we like it and want to move more code over, this is one > step. But if we are fine using both hsearch.h and simplehash.h for > overlapping use cases indefinitely, then I'll drop this. Right now there are use cases where simplehash isn't really usable (if stable pointers to hash elements are needed and/or the entries are very large). I've been wondering about providing a layer ontop of simplehash, or an option to simplehash, providing that though. That then could perhaps also implement runtime defined key sizes. I think this would be a completely fair thing to port over - whether it's worth it I don't quite know, but I'd not be against it on principle or such. Greetings, Andres Freund
On Fri, 2023-11-17 at 17:04 -0500, Tom Lane wrote: > I can't imagine wanting to convert *every* hashtable in the system > to simplehash; the added code bloat would be unreasonable. So yeah, > I think we'll have two mechanisms indefinitely. That's not to say > that we might not rewrite hsearch. But simplehash was never meant > to be a universal solution. OK, I will withdraw the patch until/unless it provides a concrete benefit. Regards, Jeff Davis
Hi, On 2023-11-17 17:04:04 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > On Fri, 2023-11-17 at 13:22 -0800, Gurjeet Singh wrote: > >> But your argument of a nicer API might make a case for the patch. > > > Yeah, that's what I was thinking. simplehash is newer and has a nicer > > API, so if we like it and want to move more code over, this is one > > step. But if we are fine using both hsearch.h and simplehash.h for > > overlapping use cases indefinitely, then I'll drop this. > > I can't imagine wanting to convert *every* hashtable in the system > to simplehash; the added code bloat would be unreasonable. Yea. And it's also just not suitable for everything. Stable pointers can be very useful and some places have entries that are too large to be moved during collisions. Chained hashtables have their place. > So yeah, I think we'll have two mechanisms indefinitely. That's not to say > that we might not rewrite hsearch. We probably should. It's awkward to use, the code is very hard to follow, and it's really not very fast. Part of that is due to serving too many masters. I doubt it's good idea to use the same code for highly contended, partitioned, shared memory hashtables and many tiny local memory hashtables. The design goals are just very different. Greetings, Andres Freund
On Fri, 2023-11-17 at 14:08 -0800, Andres Freund wrote: > I think this would be a completely fair thing to port over - whether > it's > worth it I don't quite know, but I'd not be against it on principle > or such. Right now I don't think it offers much. I'll see if I can solve the case-folding slowness first, and then maybe it will be measurable. Regards, Jeff Davis
Hi, On 2023-11-17 14:08:56 -0800, Jeff Davis wrote: > On Fri, 2023-11-17 at 17:04 -0500, Tom Lane wrote: > > I can't imagine wanting to convert *every* hashtable in the system > > to simplehash; the added code bloat would be unreasonable. So yeah, > > I think we'll have two mechanisms indefinitely. That's not to say > > that we might not rewrite hsearch. But simplehash was never meant > > to be a universal solution. > > OK, I will withdraw the patch until/unless it provides a concrete > benefit. It might already in the space domain: SELECT count(*), sum(total_bytes) total_bytes, sum(total_nblocks) total_nblocks, sum(free_bytes) free_bytes, sum(free_chunks)free_chunks, sum(used_bytes) used_bytes FROM pg_backend_memory_contexts WHERE name LIKE 'GUC%'; HEAD: ┌───────┬─────────────┬───────────────┬────────────┬─────────────┬────────────┐ │ count │ total_bytes │ total_nblocks │ free_bytes │ free_chunks │ used_bytes │ ├───────┼─────────────┼───────────────┼────────────┼─────────────┼────────────┤ │ 2 │ 57344 │ 5 │ 25032 │ 10 │ 32312 │ └───────┴─────────────┴───────────────┴────────────┴─────────────┴────────────┘ your patch: ┌───────┬─────────────┬───────────────┬────────────┬─────────────┬────────────┐ │ count │ total_bytes │ total_nblocks │ free_bytes │ free_chunks │ used_bytes │ ├───────┼─────────────┼───────────────┼────────────┼─────────────┼────────────┤ │ 1 │ 36928 │ 3 │ 12360 │ 3 │ 24568 │ └───────┴─────────────┴───────────────┴────────────┴─────────────┴────────────┘ However, it fares less well at larger number of GUCs, performance wise. At first I thought that that's largely because you aren't using SH_STORE_HASH. With that, it's slower when creating a large number of GUCs, but a good bit faster retrieving them. But that slowness didn't seem right. Then I noticed that memory usage was too large when creating many GUCs - a bit of debugging later, I figured out that that's due to guc_name_hash() being terrifyingly bad. There's no bit mixing whatsoever! Which leads to very large numbers of hash conflicts - which simplehash tries to defend against a bit by making the table larger. (gdb) p guc_name_hash("andres.c2") $14 = 3798554171 (gdb) p guc_name_hash("andres.c3") $15 = 3798554170 Fixing that makes simplehash always faster, but still doesn't win on memory usage at the upper end - the two pointers in GUCHashEntry make it too big. I think, independent of this patch, it might be worth requiring that hash table lookups applied the transformation before the lookup. A comparison function this expensive is not great... Greetings, Andres Freund
On Fri, 2023-11-17 at 15:27 -0800, Andres Freund wrote: > At > first I thought that that's largely because you aren't using > SH_STORE_HASH. I might want to use that in the search_path cache, then. The lookup wasn't showing up much in the profile the last I checked, but I'll take a second look. > Then I noticed that memory usage was too large when creating many > GUCs - a bit > of debugging later, I figured out that that's due to guc_name_hash() > being > terrifyingly bad. There's no bit mixing whatsoever! Wow. It seems like hash_combine() could be more widely used in other places, too? Here it seems like a worse problem because strings really need mixing, and maybe ExecHashGetHashValue doesn't. But it seems easier to use hash_combine() everywhere so that we don't have to think about strange cases. > I think, independent of this patch, it might be worth requiring that > hash > table lookups applied the transformation before the lookup. A > comparison > function this expensive is not great... The requested name is already case-folded in most contexts. We can do a lookup first, and if that fails, case-fold and try again. I'll hack up a patch -- I believe that would be measurable for the proconfigs. Regards, Jeff Davis
Hi, On 2023-11-17 16:01:31 -0800, Jeff Davis wrote: > On Fri, 2023-11-17 at 15:27 -0800, Andres Freund wrote: > > At > > first I thought that that's largely because you aren't using > > SH_STORE_HASH. > > I might want to use that in the search_path cache, then. The lookup > wasn't showing up much in the profile the last I checked, but I'll take > a second look. It also matters for insertions, fwiw. > > Then I noticed that memory usage was too large when creating many > > GUCs - a bit > > of debugging later, I figured out that that's due to guc_name_hash() > > being > > terrifyingly bad. There's no bit mixing whatsoever! > > Wow. > > It seems like hash_combine() could be more widely used in other places, > too? I don't think hash_combine() alone helps that much - you need to actually use a hash function for the values you are combining. Using a character value alone as a 32bit hash value unsurprisingly leads to very distribution of bits set in hashvalues. > Here it seems like a worse problem because strings really need > mixing, and maybe ExecHashGetHashValue doesn't. But it seems easier to > use hash_combine() everywhere so that we don't have to think about > strange cases. Yea. > > I think, independent of this patch, it might be worth requiring that > > hash > > table lookups applied the transformation before the lookup. A > > comparison > > function this expensive is not great... > > The requested name is already case-folded in most contexts. We can do a > lookup first, and if that fails, case-fold and try again. I'll hack up > a patch -- I believe that would be measurable for the proconfigs. I'd just always case fold before lookups. The expensive bit of the case folding imo is that you need to do awkward things during hash lookups. Greetings, Andres Freund
Hi, On Fri, 2023-11-17 at 16:10 -0800, Andres Freund wrote: > > The requested name is already case-folded in most contexts. We can > > do a > > lookup first, and if that fails, case-fold and try again. I'll hack > > up > > a patch -- I believe that would be measurable for the proconfigs. > > I'd just always case fold before lookups. The expensive bit of the > case > folding imo is that you need to do awkward things during hash > lookups. Attached are a bunch of tiny patches and some perf numbers based on simple test described here: https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com 0001: Use simplehash (without SH_STORE_HASH) 0002: fold before lookups 0003: have gen->name_key alias gen->name in typical case. Saves allocations in typical case where the name is already folded. 0004: second-chance lookup in hash table (avoids case-folding for already-folded names) 0005: Use SH_STORE_HASH (These are split out into tiny patches for perf measurement, some are pretty obvious but I wanted to see the impact, if any.) Numbers below are cumulative (i.e. 0003 includes 0002 and 0001): master: 7899ms 0001: 7850 0002: 7958 0003: 7942 0004: 7549 0005: 7411 I'm inclined toward all of these patches. I'll also look at adding SH_STORE_HASH for the search_path cache. Looks like we're on track to bring the overhead of SET search_path down to reasonable levels. Thank you! Regards, Jeff Davis
Attachment
On Mon, Nov 20, 2023 at 5:54 AM Jeff Davis <pgsql@j-davis.com> wrote: > > Attached are a bunch of tiny patches and some perf numbers based on > simple test described here: > > https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com I tried taking I/O out, like this, thinking the times would be less variable: cat bench.sql select 1 from generate_series(1,500000) x(x), lateral (SELECT inc_ab(x)) a offset 10000000; (with turbo off) pgbench -n -T 30 -f bench.sql -M prepared master: latency average = 643.625 ms 0001-0005: latency average = 607.354 ms ...about 5.5% less time, similar to what Jeff found. I get a noticeable regression in 0002, though, and I think I see why: guc_name_hash(const char *name) { - uint32 result = 0; + const unsigned char *bytes = (const unsigned char *)name; + int blen = strlen(name); The strlen call required for hashbytes() is not free. The lack of mixing in the (probably inlined after 0001) previous hash function can remedied directly, as in the attached: 0001-0002 only: latency average = 670.059 ms 0001-0002, plus revert hashbytes, add finalizer: latency average = 656.810 ms -#define SH_EQUAL(tb, a, b) (guc_name_compare(a, b) == 0) +#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0) Likewise, I suspect calling out to the C library is going to throw away some of the gains that were won by not needing to downcase all the time, but I haven't dug deeper.
Attachment
On Tue, 2023-11-21 at 16:42 +0700, John Naylor wrote: > The strlen call required for hashbytes() is not free. Should we have a hash_string() that's like hash_bytes() but checks for the NUL terminator itself? That wouldn't be inlinable, but it would save on the strlen() call. It might benefit some other callers? Regards, Jeff Davis
On Wed, Nov 22, 2023 at 12:00 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2023-11-21 at 16:42 +0700, John Naylor wrote: > > The strlen call required for hashbytes() is not free. > > Should we have a hash_string() that's like hash_bytes() but checks for > the NUL terminator itself? > > That wouldn't be inlinable, but it would save on the strlen() call. It > might benefit some other callers? We do have string_hash(), which...calls strlen. :-) Thinking some more, I'm not quite comfortable with the number of places in these patches that have to know about the pre-downcased strings, or whether we need that in the first place. If lower case is common enough to optimize for, it seems the equality function can just check strict equality on the char and only on mismatch try downcasing before returning false. Doing our own function would allow the compiler to inline it, or at least keep it on the same page. Further, the old hash function shouldn't need to branch to do the same downcasing, since hashing is lossy anyway. In the keyword hashes, we just do "*ch |= 0x20", which downcases letters and turns undercores to DEL. I can take a stab at that later.
I wrote: > Thinking some more, I'm not quite comfortable with the number of > places in these patches that have to know about the pre-downcased > strings, or whether we need that in the first place. If lower case is > common enough to optimize for, it seems the equality function can just > check strict equality on the char and only on mismatch try downcasing > before returning false. Doing our own function would allow the > compiler to inline it, or at least keep it on the same page. Further, > the old hash function shouldn't need to branch to do the same > downcasing, since hashing is lossy anyway. In the keyword hashes, we > just do "*ch |= 0x20", which downcases letters and turns undercores to > DEL. I can take a stab at that later. v4 is a quick POC for that. I haven't verified that it's correct for the case of the probe and the entry don't match, but in case it doesn't it should be easy to fix. I also didn't bother with SH_STORE_HASH in my testing. 0001 adds the murmur32 finalizer -- we should do that regardless of anything else in this thread. 0002 is just Jeff's 0001 0003 adds an equality function that downcases lazily, and teaches the hash function about the 0x20 trick. master: latency average = 581.765 ms v3 0001-0005: latency average = 544.576 ms v4 0001-0003: latency average = 547.489 ms This gives similar results with a tiny amount of code (excluding the simplehash conversion). I didn't check if the compiler inlined these functions, but we can hint it if necessary. We could use the new equality function in all the call sites that currently test for "guc_name_compare() == 0", in which case it might not end up inlined, but that's probably okay. We could also try to improve the hash function's collision behavior by collecting the bytes on a uint64 and calling our new murmur64 before returning the lower half, but that's speculative.
Attachment
Hi, On 2023-11-21 16:42:55 +0700, John Naylor wrote: > I get a noticeable regression in 0002, though, and I think I see why: > > guc_name_hash(const char *name) > { > - uint32 result = 0; > + const unsigned char *bytes = (const unsigned char *)name; > + int blen = strlen(name); > > The strlen call required for hashbytes() is not free. The lack of > mixing in the (probably inlined after 0001) previous hash function can > remedied directly, as in the attached: I doubt this is a good hashfunction. For short strings, sure, but after that... I don't think it makes sense to reduce the internal state of a hash function to something this small. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-11-21 16:42:55 +0700, John Naylor wrote: >> The strlen call required for hashbytes() is not free. The lack of >> mixing in the (probably inlined after 0001) previous hash function can >> remedied directly, as in the attached: > I doubt this is a good hashfunction. For short strings, sure, but after > that... I don't think it makes sense to reduce the internal state of a hash > function to something this small. GUC names are just about always short, though, so I'm not sure you've made your point? At worst, maybe this with 64-bit state instead of 32? regards, tom lane
Hi, On 2023-11-22 15:56:21 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-11-21 16:42:55 +0700, John Naylor wrote: > >> The strlen call required for hashbytes() is not free. The lack of > >> mixing in the (probably inlined after 0001) previous hash function can > >> remedied directly, as in the attached: > > > I doubt this is a good hashfunction. For short strings, sure, but after > > that... I don't think it makes sense to reduce the internal state of a hash > > function to something this small. > > GUC names are just about always short, though, so I'm not sure you've > made your point? With short I meant <= 6 characters (32 / 5 = 6.x). After that you're overwriting bits that you previously set, without dispersing the "overwritten" bits throughout the hash state. It's pretty easy to create conflicts this way, even just on paper. E.g. I think abcdefgg and cbcdefgw would have the same hash, because the accumulated value passed to murmurhash32 is the same. The fact that this happens when a large part of the string is the same is bad, because it makes it more likely that prefixed strings trigger such conflicts, and they're obviously common with GUC strings. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-11-22 15:56:21 -0500, Tom Lane wrote: >> GUC names are just about always short, though, so I'm not sure you've >> made your point? > With short I meant <= 6 characters (32 / 5 = 6.x). After that you're > overwriting bits that you previously set, without dispersing the "overwritten" > bits throughout the hash state. I'm less than convinced about the "overwrite" part: + /* Merge into hash ... not very bright, but it needn't be */ + result = pg_rotate_left32(result, 5); + result ^= (uint32) ch; Rotating a 32-bit value 5 bits at a time doesn't result in successive characters lining up exactly, and even once they do, XOR is not "overwrite". I'm pretty dubious that we need something better than this. regards, tom lane
Hi, On 2023-11-22 16:27:56 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-11-22 15:56:21 -0500, Tom Lane wrote: > >> GUC names are just about always short, though, so I'm not sure you've > >> made your point? > > > With short I meant <= 6 characters (32 / 5 = 6.x). After that you're > > overwriting bits that you previously set, without dispersing the "overwritten" > > bits throughout the hash state. > > I'm less than convinced about the "overwrite" part: > > + /* Merge into hash ... not very bright, but it needn't be */ > + result = pg_rotate_left32(result, 5); > + result ^= (uint32) ch; > > Rotating a 32-bit value 5 bits at a time doesn't result in successive > characters lining up exactly, and even once they do, XOR is not > "overwrite". I didn't know what word to use, hence the air quotes. Yes, xor doesn't just set the bits to the right hand side in, but it just affects data on a per-bit basis, which easily can be cancelled out. My understanding of writing hash functions is that every added bit mixed in should have a ~50% chance of causing each other bit to flip. The proposed function obviously doesn't get there. It's worth noting that the limited range of the input values means that there's a lot of bias toward some bits being set ('a' to 'z' all start with 0b011). > I'm pretty dubious that we need something better than this. Well, we know that the current attempt at a dedicated hashfunctions for this does result in substantial amounts of conflicts. And it's hard to understand such cases when you hit them, so I think it's better to avoid exposing ourselves to such dangers, without a distinct need. And I don't really see the need here to risk it, even if we are somewhat confident it's fine. If, which I mildly doubt, we can't afford to call murmurhash32 for every character, we could just call it for 32/5 input characters together. Or we could just load up to 8 characters into an 64bit integer, can call murmurhash64. Something roughly like uint64 result; while (*name) { uint64 value = 0; for (int i = 0; i < 8 && *name; i++) { char ch = *name++; value |= *name; value = value << 8; } result = hash_combine64(result, murmurhash64(value)); } The hash_combine use isn't quite right either, we should use the full accumulator state of a proper hash function, but it's seems very unlikely to matter here. The fact that string_hash() is slow due to the strlen(), which causes us to process the input twice and which is optimized to also handle very long strings which typically string_hash() doesn't encounter, seems problematic far beyond this case. We use string_hash() in a *lot* of places, and that strlen() does regularly show up in profiles. We should fix that. The various hash functions being external functions also shows up in a bunch of profiles too. It's particularly ridiculous for cases like tag_hash(), where the caller typically knows the lenght, but calls a routine in a different translation unit, which obviously can't be optimized for a specific length. I think we ought to adjust our APIs around this: 1) The accumulator state of the hash functions should be exposed, so one can accumulate values into the hash state, without reducing the internal state to a single 32/64 bit variable. 2) For callers that know the length of data, we should use a static inline hash function, rather than an external function call. This should include special cased inline functions for adding 32/64bit of data to the hash state. Perhaps with a bit of logic to *not* use the inline version if the hashed input is long (and thus the call overhead doesn't matter). Something like if (__builtin_constant_p(len) && len < 128) /* call inline implementation */ else /* call out of line implementation, not worth the code bloat */ We know that hash functions should have the split into init/process data*/finish steps, as e.g. evidenced by pg_crc.h/pg_crc32.h. With something like that, you could write a function that lowercases characters inline without incurring unnecessary overhead. hash32_state hs; hash32_init(&hs); while (*name) { char ch = *name++; /* crappy lowercase for this situation */ ch |= 0x20; hash32_process_byte(&hs, ch); } return hash32_finish(&hs); Perhaps with some additional optimization for processing the input string in 32/64 bit quantities. Greetings, Andres Freund
On Thu, Nov 23, 2023 at 5:34 AM Andres Freund <andres@anarazel.de> wrote: > It's worth noting that the limited range of the input values means that > there's a lot of bias toward some bits being set ('a' to 'z' all start with > 0b011). We can take advantage of the limited range with a single additional instruction: After "ch |= 0x20", do "ch -= ('a' - 1)". That'll shrink letters and underscores to the range [1,31], which fits in 5 bits. (Other characters are much less common in a guc name). That increases randomness and allows 12 chars to be xor'd in before the first bits rotate around. > If, which I mildly doubt, we can't afford to call murmurhash32 for every > character, we could just call it for 32/5 input characters together. Or we > could just load up to 8 characters into an 64bit integer, can call > murmurhash64. I'll play around with this idea, as well. > The fact that string_hash() is slow due to the strlen(), which causes us to > process the input twice and which is optimized to also handle very long > strings which typically string_hash() doesn't encounter, seems problematic far > beyond this case. We use string_hash() in a *lot* of places, and that strlen() > does regularly show up in profiles. We should fix that. +1 > I think we ought to adjust our APIs around this: > 1) The accumulator state of the hash functions should be exposed, so one can > accumulate values into the hash state, without reducing the internal state > to a single 32/64 bit variable. If so, it might make sense to vendor a small, suitably licensed hash function that already has these APIs. While on the subject, it'd be good to have a clear separation between in-memory and on-disk usage, so we can make breaking changes in the former.
Attached is a rough start with Andres's earlier ideas, to get something concrete out there. I took a look around at other implementations a bit. Many modern hash functions use MUM-style hashing, which typically uses 128-bit arithmetic. Even if they already have an incremental interface and have a compatible license, it seems a bit too much work to adopt just for a couple string use cases. Might be useful elsewhere, though, but that's off topic. However, I did find a couple hash functions that are much simpler to adapt to a bytewise interface, pass SMHasher, and are decently fast on short inputs: - fast-hash, MIT licensed, and apparently has some use in software [1] - MX3, CC0 license (looking around, seems controversial for a code license, so didn't go further). [2] Seems to be a for-fun project, but the accompanying articles are very informative on how to develop these things. After wacking fast-hash around, it doesn't really resemble the original much, and if for some reason we went as far as switching out the mixing/final functions, it may as well be called completely original work. I thought it best to start with something whose mixing behavior passes SMHasher, and hopefully preserve that property. Note that the combining and final steps share most of their arithmetic operations. This may have been done on purpose to minimize binary size, but I didn't check. Also, it incorporates input length into the calculation. Since we don't know the length of C strings up front, I threw that out for now. It'd be possible to track the length as we go and incorporate something into the final step. The hard part is verifying it hasn't lost any quality. v5-0001 puts fash-hash as-is into a new header, named in a way to convey in-memory use e.g. hash tables. v5-0002 does the minimal to allow dynash to use this for string_hash, inlined but still calling strlen. v5-0003 shows one way to do a incremental interface. It might be okay for simplehash with fixed length keys, but seems awkward for strings. v5-0004 shows a bytewise incremental interface, with implementations for dynahash (getting rid of strlen) and guc hash. [1] https://code.google.com/archive/p/fast-hash/ [2] https://github.com/jonmaiga/mx3
Attachment
On 29/11/2023 15:31, John Naylor wrote: > However, I did find a couple hash functions that are much simpler to > adapt to a bytewise interface, pass SMHasher, and are decently fast on > short inputs: > > - fast-hash, MIT licensed, and apparently has some use in software [1] > - MX3, CC0 license (looking around, seems controversial for a code > license, so didn't go further). [2] Seems to be a for-fun project, but > the accompanying articles are very informative on how to develop these > things. > > After wacking fast-hash around, it doesn't really resemble the > original much, and if for some reason we went as far as switching out > the mixing/final functions, it may as well be called completely > original work. I thought it best to start with something whose mixing > behavior passes SMHasher, and hopefully preserve that property. I didn't understand what you meant by the above. Did you wack around fast-hash, or who did? Who switched mixing/final functions; compared to what? The version you have in the patch matches the implementation in smhasher, did you mean that the smhasher author changed it compared to the original? In any case, +1 on the implementation you had in the patch at a quick glance. Let's also replace the partial murmurhash implementations we have in hashfn.h with this. It's a very similar algorithm, and we don't need two. -- Heikki Linnakangas Neon (https://neon.tech)
On Wed, Nov 29, 2023 at 9:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I didn't understand what you meant by the above. Did you wack around > fast-hash, or who did? I turned it into an init/accum/final style (shouldn't affect the result), and took out the input length from the calculation (will affect the result and I'll look into putting it back some other way). > Who switched mixing/final functions; compared to > what? Sorry for the confusion. I didn't change those, I was speaking hypothetically. > In any case, +1 on the implementation you had in the patch at a quick > glance. > > Let's also replace the partial murmurhash implementations we have in > hashfn.h with this. It's a very similar algorithm, and we don't need two. Thanks for taking a look! For small fixed-sized values, it's common to special-case a murmur-style finalizer regardless of the algorithm for longer inputs. Syscache combines multiple hashes for multiple keys, so it's probably worth it to avoid adding cycles there.
On Wed, Nov 29, 2023 at 8:31 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Attached is a rough start with Andres's earlier ideas, to get > something concrete out there. While looking at the assembly out of curiosity, I found a couple bugs in the split API that I've fixed locally. I think the path forward is: - performance measurements with both byte-at-a-time and word-at-a-time, once I make sure they're fixed - based on the above decide which one is best for guc_name_hash - clean up hash function implementation - test with with a new guc_name_compare (using what we learned from my guc_name_eq) and see how well we do with keeping dynahash vs. simplehash Separately, for string_hash: - run SMHasher and see about reincorporating length in the calculation. v5 should be a clear improvement in collision behavior over the current guc_name_hash, but we need to make sure it's at least as good as hash_bytes, and ideally not lose anything compared to standard fast_hash.
On Wed, 2023-11-29 at 20:31 +0700, John Naylor wrote: > v5-0001 puts fash-hash as-is into a new header, named in a way to > convey in-memory use e.g. hash tables. > > v5-0002 does the minimal to allow dynash to use this for string_hash, > inlined but still calling strlen. > > v5-0003 shows one way to do a incremental interface. It might be okay > for simplehash with fixed length keys, but seems awkward for strings. > > v5-0004 shows a bytewise incremental interface, with implementations > for dynahash (getting rid of strlen) and guc hash. I'm trying to follow the distinctions you're making between dynahash and simplehash -- are you saying it's easier to do incremental hashing with dynahash, and if so, why? If I understood what Andres was saying, the exposed hash state would be useful for writing a hash function like guc_name_hash(). But whether we use simplehash or dynahash is a separate question, right? Also, while the |= 0x20 is a nice trick for lowercasing, did we decide that it's better than my approach in patch 0004 here: https://www.postgresql.org/message-id/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.camel@j-davis.com which optimizes exact hits (most GUC names are already folded) before trying case folding? Regards, Jeff Davis
On Mon, Dec 4, 2023 at 4:16 AM Jeff Davis <pgsql@j-davis.com> wrote: > I'm trying to follow the distinctions you're making between dynahash > and simplehash -- are you saying it's easier to do incremental hashing > with dynahash, and if so, why? That's a good thing to clear up. This thread has taken simplehash as a starting point from the very beginning. It initially showed no improvement, and then we identified problems with the hashing and equality computations. The latter seem like independently commitable improvements, so I'm curious if they help on their own, even if we still need to switch to simplehash as a last step to meet your performance goals. > If I understood what Andres was saying, the exposed hash state would be > useful for writing a hash function like guc_name_hash(). From my point of view, it would at least be useful for C-strings, where we don't have the length available up front. Aside from that, we have multiple places that compute full 32-bit hashes on multiple individual values, and then combine them with various ad-hoc ways. It could be worth exploring whether an incremental interface would be better in those places on a case-by-case basis. (If Andres had something else in mind, I'll let him address that.) > But whether we > use simplehash or dynahash is a separate question, right? Right, the table implementation should treat the hash function as a black box. Think of the incremental API as lower-level building blocks for building hash functions. > Also, while the |= 0x20 is a nice trick for lowercasing, did we decide > that it's better than my approach in patch 0004 here: > > https://www.postgresql.org/message-id/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.camel@j-davis.com > > which optimizes exact hits (most GUC names are already folded) before > trying case folding? Note there were two aspects there: hashing and equality. I demonstrated in https://www.postgresql.org/message-id/CANWCAZbQ30O9j-bEZ_1zVCyKPpSjwbE4u19cSDDBJ%3DTYrHvPig%40mail.gmail.com ... in v4-0003 that the equality function can be optimized for already-folded names (and in fact measured almost equally) using way, way, way less code.
On Mon, 2023-12-04 at 12:12 +0700, John Naylor wrote: > That's a good thing to clear up. This thread has taken simplehash as > a > starting point from the very beginning. It initially showed no > improvement, and then we identified problems with the hashing and > equality computations. The latter seem like independently commitable > improvements, so I'm curious if they help on their own, even if we > still need to switch to simplehash as a last step to meet your > performance goals. There's already a patch to use simplehash, and the API is a bit cleaner, and there's a minor performance improvement. It seems fairly non-controversial -- should I just proceed with that patch? > > If I understood what Andres was saying, the exposed hash state > > would be > > useful for writing a hash function like guc_name_hash(). > > From my point of view, it would at least be useful for C-strings, > where we don't have the length available up front. That's good news. By the way, is there any reason that we would need hash_bytes(s, strlen(s)) == cstring_hash(s)? > > Also, while the |= 0x20 is a nice trick for lowercasing, did we > > decide > > that it's better than my approach in patch 0004 here: > > > > https://www.postgresql.org/message-id/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.camel@j-davis.com > > > > which optimizes exact hits (most GUC names are already folded) > > before > > trying case folding? > > Note there were two aspects there: hashing and equality. I > demonstrated in > > https://www.postgresql.org/message-id/CANWCAZbQ30O9j-bEZ_1zVCyKPpSjwbE4u19cSDDBJ%3DTYrHvPig%40mail.gmail.com > > ... in v4-0003 that the equality function can be optimized for > already-folded names (and in fact measured almost equally) using way, > way, way less code. Thinking in terms of API layers, there are two approaches: (a) make the hash and equality functions aware of the case-insensitivity, as we currently do; or (b) make it the caller's responsibility to do case folding, and the hash and equality functions are based on exact equality. Each approach has its own optimization techniques. In (a), we can use the |= 0x20 trick, and for equality do a memcmp() check first. In (b), the caller can first try lookup of the key in whatever form is provided, and only if that fails, case-fold it and try again. As a tangential point, we may eventually want to provide a more internationalized definition of "case insensitive" for GUC names. That would be slightly easier with (b) than with (a), but we can cross that bridge if and when we come to it. It seems you are moving toward (a) whereas my patches moved toward (b). I am fine with either approach but I wanted to clarify which approach we are using. In the abstract, I kind of like approach (b) because we don't need to be as special/clever with the hash functions. We would still want the faster hash for C-strings, but that's general and helps all callers. But you're right that it's more code, and that's not great. Regards, Jeff Davis
On Tue, Dec 5, 2023 at 1:57 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Mon, 2023-12-04 at 12:12 +0700, John Naylor wrote: > There's already a patch to use simplehash, and the API is a bit > cleaner, and there's a minor performance improvement. It seems fairly > non-controversial -- should I just proceed with that patch? I won't object if you want to commit that piece now, but I hesitate to call it a performance improvement on its own. - The runtime measurements I saw reported were well within the noise level. - The memory usage starts out better, but with more entries is worse. > > From my point of view, it would at least be useful for C-strings, > > where we don't have the length available up front. > > That's good news. > > By the way, is there any reason that we would need hash_bytes(s, > strlen(s)) == cstring_hash(s)? "git grep cstring_hash" found nothing, so not sure what you're asking. > Each approach has its own optimization techniques. In (a), we can use > the |= 0x20 trick, and for equality do a memcmp() check first. I will assume you are referring to semantics, but on the odd chance readers take this to mean the actual C library call, that wouldn't be an optimization, that'd be a pessimization. > As a tangential point, we may eventually want to provide a more > internationalized definition of "case insensitive" for GUC names. That > would be slightly easier with (b) than with (a), but we can cross that > bridge if and when we come to it. The risk/reward ratio seems pretty bad. > It seems you are moving toward (a) whereas my patches moved toward (b). > I am fine with either approach but I wanted to clarify which approach > we are using. I will make my case: > In the abstract, I kind of like approach (b) because we don't need to > be as special/clever with the hash functions. In the abstract, I consider (b) to be a layering violation. As a consequence, the cleverness in (b) is not confined to one or two places, but is smeared over a whole bunch of places. I find it hard to follow. Concretely, it also adds another pointer to the element struct. That's not good for a linear open-addressing array, which simplehash has. Further, remember the equality function is important as well. In v3, it was "strcmp(a,b)==0", which is a holdover from the dynahash API. One of the advantages of the simplehash API is that we can 1) use an equality function, which should be slightly cheaper than a full comparison function, and 2) we have the option to inline it. (It doesn't make sense in turn, to jump to a shared lib page and invoke an indirect function call.) Once we've done that, it's already "special", so it's not a stretch to make it do what we want to begin with. If a nicer API is important, why not use it?
On Wed, 2023-12-06 at 07:39 +0700, John Naylor wrote: > "git grep cstring_hash" found nothing, so not sure what you're > asking. Sorry, I meant string_hash(). Your v5-0002 changes the way hashing works for cstrings, and that means it's no longer equivalent to hash_bytes with strlen. That's probably fine, but someone might assume that they are equivalent. > > In the abstract, I consider (b) to be a layering violation. As a > consequence, the cleverness in (b) is not confined to one or two > places, but is smeared over a whole bunch of places. I find it hard > to > follow. OK. I am fine with (a). Regards, Jeff Davis
On Wed, Dec 6, 2023 at 11:48 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2023-12-06 at 07:39 +0700, John Naylor wrote: > > "git grep cstring_hash" found nothing, so not sure what you're > > asking. > > Sorry, I meant string_hash(). Your v5-0002 changes the way hashing > works for cstrings, and that means it's no longer equivalent to > hash_bytes with strlen. That's probably fine, but someone might assume > that they are equivalent. That's a good point. It might be best to leave string_hash where it is and remove the comment that it's the default. Then the new function (I like the name cstring_hash) can live in dynahash.c where it's obvious what "default" means.
On Wed, 2023-11-29 at 20:31 +0700, John Naylor wrote: > Attached is a rough start with Andres's earlier ideas, to get > something concrete out there. The implementation of string hash in 0004 forgot to increment 'buf'. I tested using the new hash function APIs for my search path cache, and there's a significant speedup for cases not benefiting from a86c61c9ee. It's enough that we almost don't need a86c61c9ee. So a definite +1 to the new APIs. Regards, Jeff Davis
I committed 867dd2dc87, which means my use case for a fast GUC hash table (quickly setting proconfigs) is now solved. Andres mentioned that it could still be useful to reduce overhead in a few other places: https://postgr.es/m/20231117220830.t6sb7di6h6am4ep5@awork3.anarazel.de How should we evaluate GUC hash table performance optimizations? Just microbenchmarks, or are there end-to-end tests where the costs are showing up? (As I said in another email, I think the hash function APIs justify themselves regardless of improvements to the GUC hash table.) On Wed, 2023-12-06 at 07:39 +0700, John Naylor wrote: > > There's already a patch to use simplehash, and the API is a bit > > cleaner, and there's a minor performance improvement. It seems > > fairly > > non-controversial -- should I just proceed with that patch? > > I won't object if you want to commit that piece now, but I hesitate > to > call it a performance improvement on its own. > > - The runtime measurements I saw reported were well within the noise > level. > - The memory usage starts out better, but with more entries is worse. I suppose I'll wait until there's a reason to commit it, then. Regards, Jeff Davis
On Sat, Dec 9, 2023 at 3:32 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2023-11-29 at 20:31 +0700, John Naylor wrote: > > Attached is a rough start with Andres's earlier ideas, to get > > something concrete out there. > > The implementation of string hash in 0004 forgot to increment 'buf'. Yeah, that was one of the bugs I mentioned. In v6, I fixed it so we get the right answer. 0001 pure copy of fasthash upstream 0002 keeps the originals for validation, and then re-implements them using the new incremental interfaces 0003 adds UINT64CONST. After writing this I saw that murmur64 didn't have UINT64CONST (and obviously no buildfarm member complained), so probably not needed. 0004 Assert that the original and incrementalized versions give the same answer. This requires the length to be known up front. 0005 Demo with pgstat_hash_hash_key, which currently runs 3 finalizers joined with hash_combine. Might shave a few cycles. 0006 Add bytewise interface for C strings. 0007 Use it in guc_name_hash 0008 Teach guc_name_cmp to case fold lazily I'll test these two and see if there's a detectable difference. Then each of these: 0009 Jeff's conversion to simplehash 0010 Use an inline equality function for guc nam. hash 0011/12 An experiment to push case-folding down inside fasthash. It's not great looking, but I'm curious if it makes a difference. 0013 Get rid of strlen in dynahash with default string hashing. I'll hold on to this and start a new thread, because it's off-topic and has some open questions. I haven't tested yet, but I want to see what CI thinks. > I tested using the new hash function APIs for my search path cache, and > there's a significant speedup for cases not benefiting from a86c61c9ee. > It's enough that we almost don't need a86c61c9ee. So a definite +1 to > the new APIs. Do you have a new test?
Attachment
- v6-0001-Vendor-fasthash.patch
- v6-0003-Add-UINT64CONST-not-sure-when-we-actually-need-th.patch
- v6-0004-Assert-that-the-incremental-fasthash-variants-giv.patch
- v6-0002-Rewrite-fasthash-functions-using-a-homegrown-incr.patch
- v6-0005-Demonstrate-fasthash32-with-pgstat_hash_hash_key.patch
- v6-0007-Use-bytewise-fasthash-in-guc_name_hash.patch
- v6-0006-Add-bytewise-interface.patch
- v6-0009-Convert-GUC-hashtable-to-use-simplehash.patch
- v6-0008-Casefold-lazily-in-guc_name_compare.patch
- v6-0010-Use-inline-equality-function-for-guc-hash.patch
- v6-0013-PoC-Get-rid-of-strlen-calls-when-using-HASH_STRIN.patch
- v6-0011-Add-abiliy-to-case-fold-with-chunk-8-byte-interfa.patch
- v6-0012-Use-chunk-interface-for-guc_name_hash.patch
On Sat, 2023-12-09 at 18:52 +0700, John Naylor wrote: > > I tested using the new hash function APIs for my search path cache, > > and > > there's a significant speedup for cases not benefiting from > > a86c61c9ee. > > It's enough that we almost don't need a86c61c9ee. So a definite +1 > > to > > the new APIs. > > Do you have a new test? Still using the same basic test here: https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com What I did is: a. add your v5 patches b. disable optimization in a86c61c9ee c. add attached patch to use new hash APIs I got a slowdown between (a) and (b), and then (c) closed the gap about halfway. It started to get close to test noise at that point -- I could get some better numbers out of it if it's helpful. Also, what I'm doing in the attached path is using part of the key as the seed. Is that a good idea or should the seed be zero or come from somewhere else? Regards, Jeff Davis
Attachment
On Sun, Dec 10, 2023 at 2:18 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Sat, 2023-12-09 at 18:52 +0700, John Naylor wrote: > > > I tested using the new hash function APIs for my search path cache, > > > and > > > there's a significant speedup for cases not benefiting from > > > a86c61c9ee. > > > It's enough that we almost don't need a86c61c9ee. So a definite +1 > > > to > > > the new APIs. Interesting, thanks for testing! SearchPathCache is a better starting point than dynahash for removing strlen calls anyway -- it's more localized, uses simplehash, and we can test it with at-hand tests. > > Do you have a new test? > > Still using the same basic test here: > > https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com > > What I did is: > > a. add your v5 patches > b. disable optimization in a86c61c9ee > c. add attached patch to use new hash APIs Of course, the CF bot doesn't know this, so it crashed and burned before I had a chance to check how v6 did. I'm attaching v7 which just improves commit messages for reviewing, and gets rid of git whitespace errors. My local branch of master is still at 457428d9e99b6 from Dec 4. That's before both a86c61c9ee (Optimize SearchPathCache by saving the last entry.) and 867dd2dc87 (Cache opaque handle for GUC option to avoid repeasted lookups.). My plan was to keep testing against Dec. 4, but like you I'm not sure if there is a better GUC test to do now. > I got a slowdown between (a) and (b), and then (c) closed the gap about > halfway. It started to get close to test noise at that point -- I could > get some better numbers out of it if it's helpful. We can also try (c) with using the "chunked" interface. Also note your patch may no longer apply on top of v6 or v7. > Also, what I'm doing in the attached path is using part of the key as > the seed. Is that a good idea or should the seed be zero or come from > somewhere else? I think whether to use part of the key as a seed is a judgment call. See this part in resowner.c: /* * Most resource kinds store a pointer in 'value', and pointers are unique * all on their own. But some resources store plain integers (Files and * Buffers as of this writing), so we want to incorporate the 'kind' in * the hash too, otherwise those resources will collide a lot. But * because there are only a few resource kinds like that - and only a few * resource kinds to begin with - we don't need to work too hard to mix * 'kind' into the hash. Just add it with hash_combine(), it perturbs the * result enough for our purposes. */ #if SIZEOF_DATUM == 8 return hash_combine64(murmurhash64((uint64) value), (uint64) kind); Given these comments, I'd feel free to use the "kind" as the seed if I were writing this with fasthash. The caller-provided seed can probably be zero unless we have a good reason to, like the above, but with the incremental interface there is an issue: hs->hash = seed ^ (len * UINT64CONST(0x880355f21e6d1965)); Passing length 0 will wipe out the internal seed here, and that can't be good. 1) We could by convention pass "1" as the length for strings. That could be a macro like #define FH_UNKNOWN_LENGTH 1 ...and maybe Assert(len != 0 || seed != 0) Or 2) we could detect zero and force it to be one, but it's best if the compiler can always constant-fold that branch. Future work may invalidate that assumption.
Attachment
- v7-0011-Add-abiliy-to-case-fold-with-chunk-8-byte-interfa.patch
- v7-0012-Use-chunk-interface-for-guc_name_hash.patch
- v7-0013-PoC-Get-rid-of-strlen-calls-when-using-HASH_STRIN.patch
- v7-0009-Convert-GUC-hashtable-to-use-simplehash.patch
- v7-0010-Use-inline-equality-function-for-guc-hash.patch
- v7-0006-Add-bytewise-interface.patch
- v7-0005-Demonstrate-fasthash32-with-pgstat_hash_hash_key.patch
- v7-0008-Casefold-lazily-in-guc_name_compare.patch
- v7-0007-Use-bytewise-fasthash-in-guc_name_hash.patch
- v7-0004-Assert-that-the-incremental-fasthash-variants-giv.patch
- v7-0003-Add-UINT64CONST-not-sure-when-we-actually-need-th.patch
- v7-0002-Rewrite-fasthash-functions-using-a-homegrown-incr.patch
- v7-0001-Vendor-fasthash.patch
I wrote: > On Sun, Dec 10, 2023 at 2:18 AM Jeff Davis <pgsql@j-davis.com> wrote: > > > > On Sat, 2023-12-09 at 18:52 +0700, John Naylor wrote: > > > > I tested using the new hash function APIs for my search path cache, > > > > and > > > > there's a significant speedup for cases not benefiting from > > > > a86c61c9ee. > > > > It's enough that we almost don't need a86c61c9ee. So a definite +1 > > > > to > > > > the new APIs. > > Interesting, thanks for testing! SearchPathCache is a better starting > point than dynahash for removing strlen calls anyway -- it's more > localized, uses simplehash, and we can test it with at-hand tests. Since I had to fix a misalignment in the original to keep ubsan from crashing CI anyway (v8-0005), I thought I'd take the experiment with search path cache and put the temporary validation of the hash function output in there (v8-0004). I had to finagle a bit to get the bytewise interface to give the same answer as the original, but that's okay: The bytewise interface is intended for when we don't know the length up front (and therefore the internal seed can't be tweaked with the length), but it's nice to make sure nothing's broken. There is also a chunkwise version for search path cache. That might be a little faster. Perf testing can be done as is, because I put the validation in assert builds only. I've left out the GUC stuff for now, just want to get CI green again.
Attachment
On Sun, Dec 10, 2023 at 2:18 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Sat, 2023-12-09 at 18:52 +0700, John Naylor wrote: > > > I tested using the new hash function APIs for my search path cache, > > > and > > > there's a significant speedup for cases not benefiting from > > > a86c61c9ee. > > > It's enough that we almost don't need a86c61c9ee. So a definite +1 > > > to > > > the new APIs. > > > > Do you have a new test? > > Still using the same basic test here: > > https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com > > What I did is: > > a. add your v5 patches > b. disable optimization in a86c61c9ee > c. add attached patch to use new hash APIs > > I got a slowdown between (a) and (b), and then (c) closed the gap about > halfway. It started to get close to test noise at that point -- I could > get some better numbers out of it if it's helpful. I tried my variant of the same test [1] (but only 20 seconds per run), which uses pgbench to take the average of a few dozen runs, and doesn't use table I/O (when doing that, it's best to pre-warm the buffers to reduce noise). pgbench -n -T 20 -f bench.sql -M prepared (done three times and take the median, with turbo off) * master at 457428d9e99b6b from Dec 4: latency average = 571.413 ms * v8 (bytewise hash): latency average = 588.942 ms This regression is a bit surprising, since there is no strlen call, and it uses roleid as a seed without a round of mixing (not sure if we should do that, but just trying to verify results). * v8 with chunked interface: latency average = 555.688 ms This starts to improve things for me. * v8 with chunked, and return lower 32 bits of full 64-bit hash: latency average = 556.324 ms This is within the noise level. There doesn't seem to be much downside of using a couple cycles for fasthash's 32-bit reduction. * revert back to master from Dec 4 and then cherry pick a86c61c9ee (save last entry of SearchPathCache) latency average = 545.747 ms So chunked incremental hashing gets within ~2% of that, which is nice. It seems we should use that when removing strlen, when convenient. Updated next steps: * Investigate whether/how to incorporate final length into the calculation when we don't have the length up front. * Add some desperately needed explanatory comments. * Use this in some existing cases where it makes sense. * Get back to GUC hash and dynahash. [1] https://www.postgresql.org/message-id/CANWCAZY7Cr-GdUhrCLoR4%2BJGLChTb0pQxq9ZPi1KTLs%2B_KDFqg%40mail.gmail.com
I wrote: > > * v8 with chunked interface: > latency average = 555.688 ms > > This starts to improve things for me. > > * v8 with chunked, and return lower 32 bits of full 64-bit hash: > latency average = 556.324 ms > > This is within the noise level. There doesn't seem to be much downside > of using a couple cycles for fasthash's 32-bit reduction. > > * revert back to master from Dec 4 and then cherry pick a86c61c9ee > (save last entry of SearchPathCache) > latency average = 545.747 ms > > So chunked incremental hashing gets within ~2% of that, which is nice. > It seems we should use that when removing strlen, when convenient. > > Updated next steps: > * Investigate whether/how to incorporate final length into the > calculation when we don't have the length up front. > * Add some desperately needed explanatory comments. > * Use this in some existing cases where it makes sense. > * Get back to GUC hash and dynahash. For #1 here, I cloned SMHasher and was dismayed at the complete lack of documentation, but after some poking around, found how to run the tests, using the 32-bit hash to save time. It turns out that the input length is important. I've attached two files of results -- "nolen" means stop using the initial length to tweak the internal seed. As you can see, there are 8 failures. "pluslen" means I then incorporated the length within the finalizer. This *does* pass SMHasher, so that's good. (of course this way can't produce the same hash as when we know the length up front, but that's not important). The attached shows how that would work, further whacking around and testing with Jeff's prototype for the search path cache hash table. I'll work on code comments and get it polished.
Attachment
- fasthash32_nolen.txt
- fasthash32_pluslen.txt
- v9-0004-Assert-that-incremental-fasthash-variants-give-th.patch
- v9-0006-Add-optional-tweak-to-finalizer.patch
- v9-0002-Rewrite-fasthash-functions-using-a-homegrown-incr.patch
- v9-0003-Fix-alignment-issue-in-the-original-fastash.patch
- v9-0005-Remove-ULL.patch
- v9-0001-Vendor-fasthash.patch
I wrote: > Updated next steps: > * Add some desperately needed explanatory comments. There is a draft of this in v10-0001. I also removed the validation scaffolding and ran pgindent. This could use some review and/or bikeshedding, in particular on the name hashfn_unstable.h. I also considered *_volatile.h or *_inmemory.h, but nothing stands out as more clear. > * Use this in some existing cases where it makes sense. For now just two: v10-0002 is Jeff's change to the search path cache, but with the chunked interface that I found to be faster. v10-0003 is a copy of something buried in an earlier version: use in pgstat. Looks nicer, but not yet tested.
Attachment
On Mon, 2023-12-18 at 13:39 +0700, John Naylor wrote: > For now just two: > v10-0002 is Jeff's change to the search path cache, but with the > chunked interface that I found to be faster. Did you consider specializing for the case of an aligned pointer? If it's a string (c string or byte string) it's almost always going to be aligned, right? I hacked up a patch (attached). I lost track of which benchmark we're using to test the performance, but when I test in a loop it seems substantially faster. It reads past the NUL byte, but only to the next alignment boundary, which I think is OK (though I think I'd need to fix the patch for when maxalign < 8). Regards, Jeff Davis
Attachment
On Tue, Dec 19, 2023 at 2:32 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Mon, 2023-12-18 at 13:39 +0700, John Naylor wrote: > > For now just two: > > v10-0002 is Jeff's change to the search path cache, but with the > > chunked interface that I found to be faster. > > Did you consider specializing for the case of an aligned pointer? If > it's a string (c string or byte string) it's almost always going to be > aligned, right? That wasn't the next place I thought to look (that would be the strcmp call), but something like this could be worthwhile. If we went this far, I'd like to get more use out of it than one call site. I think a few other places have as their hash key a string along with other values, so maybe we can pass an initialized hash state for strings separately from combining in the other values. Dynahash will still need to deal with truncation, so would need duplicate coding, but I'm guessing with that truncation check it's makes an optimization like you propose even more worthwhile. > I hacked up a patch (attached). I lost track of which benchmark we're > using to test the performance, but when I test in a loop it seems > substantially faster. That's interesting. Note that there is no need for a new fasthash_accum64(), since we can do fasthash_accum(&hs, buf, FH_SIZEOF_ACCUM); ...and the compiler should elide the switch statement. > It reads past the NUL byte, but only to the next alignment boundary, > which I think is OK (though I think I'd need to fix the patch for when > maxalign < 8). Seems like it, on both accounts.
On Tue, 2023-12-19 at 16:23 +0700, John Naylor wrote: > That wasn't the next place I thought to look (that would be the > strcmp > call), but something like this could be worthwhile. The reason I looked here is that the inner while statement (to find the chunk size) looked out of place and possibly slow, and there's a bitwise trick we can use instead. My original test case is a bit too "macro" of a benchmark at this point, so I'm not sure it's a good guide for these individual micro- optimizations. Regards, Jeff Davis
On Wed, Dec 20, 2023 at 3:23 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2023-12-19 at 16:23 +0700, John Naylor wrote: > > That wasn't the next place I thought to look (that would be the > > strcmp > > call), but something like this could be worthwhile. > > The reason I looked here is that the inner while statement (to find the > chunk size) looked out of place and possibly slow, and there's a > bitwise trick we can use instead. There are other bit tricks we can use. In v11-0005 Just for fun, I translated a couple more into C from https://github.com/openbsd/src/blob/master/lib/libc/arch/amd64/string/strlen.S
Attachment
On Wed, Dec 20, 2023 at 1:48 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 3:23 AM Jeff Davis <pgsql@j-davis.com> wrote: > > > > The reason I looked here is that the inner while statement (to find the > > chunk size) looked out of place and possibly slow, and there's a > > bitwise trick we can use instead. > > There are other bit tricks we can use. In v11-0005 Just for fun, I > translated a couple more into C from > > https://github.com/openbsd/src/blob/master/lib/libc/arch/amd64/string/strlen.S I wanted to see if this gets us anything so ran a couple microbenchmarks. 0001-0003 are same as earlier 0004 takes Jeff's idea and adds in an optimization from NetBSD's strlen (I said OpenBSD earlier, but it goes back further). I added stub code to simulate big-endian when requested at compile time, but a later patch removes it. Since it benched well, I made the extra effort to generalize it for other callers. After adding to the hash state, it returns the length so the caller can pass it to the finalizer. 0005 is the benchmark (not for commit) -- I took the parser keyword list and added enough padding to make every string aligned when the whole thing is copied to an alloc'd area. Each of the bench_*.sql files named below are just running the similarly-named function, all with the same argument, e.g. "select * from bench_pgstat_hash_fh(100000);", so not attached. Strings: -- strlen + hash_bytes pgbench -n -T 20 -f bench_hash_bytes.sql -M prepared | grep latency latency average = 1036.732 ms -- word-at-a-time hashing, with bytewise lookahead pgbench -n -T 20 -f bench_cstr_unaligned.sql -M prepared | grep latency latency average = 664.632 ms -- word-at-a-time for both hashing and lookahead (Jeff's aligned coding plus a technique from NetBSD strlen) pgbench -n -T 20 -f bench_cstr_aligned.sql -M prepared | grep latency latency average = 436.701 ms So, the fully optimized aligned case is worth it if it's convenient. 0006 adds a byteswap for big-endian so we can reuse little endian coding for the lookahead. 0007 - I also wanted to put numbers to 0003 (pgstat hash). While the motivation for that was cleanup, I had a hunch it would shave cycles and take up less binary space. It does on both accounts: -- 3x murmur + hash_combine pgbench -n -T 20 -f bench_pgstat_orig.sql -M prepared | grep latency latency average = 333.540 ms -- fasthash32 (simple call, no state setup and final needed for a single value) pgbench -n -T 20 -f bench_pgstat_fh.sql -M prepared | grep latency latency average = 277.591 ms 0008 - We can optimize the tail load when it's 4 bytes -- to save loads, shifts, and OR's. My compiler can't figure this out for the pgstat hash, with its fixed 4-byte tail. It's pretty simple and should help other cases: pgbench -n -T 20 -f bench_pgstat_fh.sql -M prepared | grep latency latency average = 226.113 ms
Attachment
- v11-0005-Add-benchmark-for-hashing-C-strings.patch
- v11-0004-Add-optimized-string-hashing-to-hashfn_unstable..patch
- v11-0007-Add-bench-for-pgstat.patch
- v11-0008-Optimize-loading-tail-when-4-bytes.patch
- v11-0006-Try-simply-byte-swapping-on-BE-machines-and-then.patch
- v11-0001-Add-inlineable-incremental-hash-functions-for-in.patch
- v11-0003-Use-fasthash-for-the-search-path-cache.patch
- v11-0002-Use-fasthash-for-pgstat_hash_hash_key.patch
On Tue, Dec 26, 2023 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote: > > 0001-0003 are same as earlier > 0004 takes Jeff's idea and adds in an optimization from NetBSD's > strlen (I said OpenBSD earlier, but it goes back further). I added > stub code to simulate big-endian when requested at compile time, but a > later patch removes it. Since it benched well, I made the extra effort > to generalize it for other callers. After adding to the hash state, it > returns the length so the caller can pass it to the finalizer. > 0005 is the benchmark (not for commit) -- I took the parser keyword > list and added enough padding to make every string aligned when the > whole thing is copied to an alloc'd area. > > Each of the bench_*.sql files named below are just running the > similarly-named function, all with the same argument, e.g. "select * > from bench_pgstat_hash_fh(100000);", so not attached. > > Strings: > > -- strlen + hash_bytes > pgbench -n -T 20 -f bench_hash_bytes.sql -M prepared | grep latency > latency average = 1036.732 ms > > -- word-at-a-time hashing, with bytewise lookahead > pgbench -n -T 20 -f bench_cstr_unaligned.sql -M prepared | grep latency > latency average = 664.632 ms > > -- word-at-a-time for both hashing and lookahead (Jeff's aligned > coding plus a technique from NetBSD strlen) > pgbench -n -T 20 -f bench_cstr_aligned.sql -M prepared | grep latency > latency average = 436.701 ms > > So, the fully optimized aligned case is worth it if it's convenient. > > 0006 adds a byteswap for big-endian so we can reuse little endian > coding for the lookahead. > > 0007 - I also wanted to put numbers to 0003 (pgstat hash). While the > motivation for that was cleanup, I had a hunch it would shave cycles > and take up less binary space. It does on both accounts: > > -- 3x murmur + hash_combine > pgbench -n -T 20 -f bench_pgstat_orig.sql -M prepared | grep latency > latency average = 333.540 ms > > -- fasthash32 (simple call, no state setup and final needed for a single value) > pgbench -n -T 20 -f bench_pgstat_fh.sql -M prepared | grep latency > latency average = 277.591 ms > > 0008 - We can optimize the tail load when it's 4 bytes -- to save > loads, shifts, and OR's. My compiler can't figure this out for the > pgstat hash, with its fixed 4-byte tail. It's pretty simple and should > help other cases: > > pgbench -n -T 20 -f bench_pgstat_fh.sql -M prepared | grep latency > latency average = 226.113 ms --- /dev/null +++ b/contrib/bench_hash/bench_hash.c @@ -0,0 +1,103 @@ +/*------------------------------------------------------------------------- + * + * bench_hash.c + * + * Copyright (c) 2023, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/test/modules/bench_hash/bench_hash.c + * + *------------------------------------------------------------------------- + */ you added this module to contrib module (root/contrib), your intention (i guess) is to add in root/src/test/modules. later I saw "0005 is the benchmark (not for commit)". --- /dev/null +++ b/src/include/common/hashfn_unstable.h @@ -0,0 +1,213 @@ +/* +Building blocks for creating fast inlineable hash functions. The +unstable designation is in contrast to hashfn.h, which cannot break +compatibility because hashes can be writen to disk and so must produce +the same hashes between versions. + + * + * Portions Copyright (c) 2018-2023, PostgreSQL Global Development Group + * + * src/include/common/hashfn_unstable.c + */ + here should be "src/include/common/hashfn_unstable.h". typo: `writen` In pgbench, I use --no-vacuum --time=20 -M prepared My local computer is slow. but here is the test results: select * from bench_cstring_hash_aligned(100000); 7318.893 ms select * from bench_cstring_hash_unaligned(100000); 10383.173 ms select * from bench_pgstat_hash(100000); 4474.989 ms select * from bench_pgstat_hash_fh(100000); 9192.245 ms select * from bench_string_hash(100000); 2048.008 ms
On Tue, Jan 2, 2024 at 6:56 AM jian he <jian.universality@gmail.com> wrote: > > My local computer is slow. but here is the test results: > > select * from bench_cstring_hash_aligned(100000); 7318.893 ms > select * from bench_cstring_hash_unaligned(100000); 10383.173 ms > select * from bench_pgstat_hash(100000); 4474.989 ms > select * from bench_pgstat_hash_fh(100000); 9192.245 ms > select * from bench_string_hash(100000); 2048.008 ms This presents a 2x to 5x slowdown, so I'm skeptical this is typical -- what kind of platform is. For starters, what CPU and compiler?
On Wed, Jan 3, 2024 at 10:12 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 2, 2024 at 6:56 AM jian he <jian.universality@gmail.com> wrote: > > > > My local computer is slow. but here is the test results: > > > > select * from bench_cstring_hash_aligned(100000); 7318.893 ms > > select * from bench_cstring_hash_unaligned(100000); 10383.173 ms > > select * from bench_pgstat_hash(100000); 4474.989 ms > > select * from bench_pgstat_hash_fh(100000); 9192.245 ms > > select * from bench_string_hash(100000); 2048.008 ms > > This presents a 2x to 5x slowdown, so I'm skeptical this is typical -- > what kind of platform is. For starters, what CPU and compiler? I still cannot git apply your patch cleanly. in http://cfbot.cputube.org/ i cannot find your patch. ( so, it might be that I test based on incomplete information). but only hashfn_unstable.h influences bench_hash/bench_hash.c. so I attached the whole patch that I had git applied, that is the changes i applied for the following tests. how I test using pgbench: pgbench --no-vacuum --time=20 --file /home/jian/tmp/bench_cstring_hash_aligned.sql -M prepared | grep latency The following is tested with another machine, also listed machine spec below. I tested 3 times, the results is very similar as following: select * from bench_cstring_hash_aligned(100000); 4705.686 ms select * from bench_cstring_hash_unaligned(100000); 6835.753 ms select * from bench_pgstat_hash(100000); 2678.978 ms select * from bench_pgstat_hash_fh(100000); 6199.017 ms select * from bench_string_hash(100000); 847.699 ms src6=# select version(); version -------------------------------------------------------------------- PostgreSQL 17devel on x86_64-linux, compiled by gcc-11.4.0, 64-bit (1 row) jian@jian:~/Desktop/pg_src/src6/postgres$ gcc --version gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. lscpu: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-14600K CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 1 Stepping: 1 CPU max MHz: 5300.0000 CPU min MHz: 800.0000 BogoMIPS: 6988.80 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp l m constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vm x smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetc h cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hw p_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_l br ibt flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 544 KiB (14 instances) L1i: 704 KiB (14 instances) L2: 20 MiB (8 instances) L3: 24 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-19 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected jian@jian:~/Desktop/pg_src/src6/postgres$ git log commit bbbf8cd54a05ad5c92e79c96133f219e80fad77c (HEAD -> master) Author: jian he <jian.universality@gmail.com> Date: Thu Jan 4 10:32:39 2024 +0800 bench_hash contrib module commit c5385929593dd8499cfb5d85ac322e8ee1819fd4 Author: Peter Eisentraut <peter@eisentraut.org> Date: Fri Dec 29 18:01:53 2023 +0100 Make all Perl warnings fatal
Attachment
On Thu, Jan 4, 2024 at 10:01 AM jian he <jian.universality@gmail.com> wrote: > > I still cannot git apply your patch cleanly. in I don't know why you're using that -- the git apply man page even says "Use git-am(1) to create commits from patches generated by git-format-patch(1) and/or received by email." Or, if that fails, use "patch". > http://cfbot.cputube.org/ i cannot find your patch. > ( so, it might be that I test based on incomplete information). > but only hashfn_unstable.h influences bench_hash/bench_hash.c. > > so I attached the whole patch that I had git applied, that is the > changes i applied for the following tests. Well, aside from the added text-editor detritus, it looks like this has everything except v11-0008, without which I still get improvement for the pgstat hash. > Model name: Intel(R) Core(TM) i5-14600K > The following is tested with another machine, also listed machine spec below. > I tested 3 times, the results is very similar as following: > select * from bench_cstring_hash_aligned(100000); 4705.686 ms > select * from bench_cstring_hash_unaligned(100000); 6835.753 ms > select * from bench_pgstat_hash(100000); 2678.978 ms > select * from bench_pgstat_hash_fh(100000); 6199.017 ms > select * from bench_string_hash(100000); 847.699 ms I was fully prepared to believe something like 32-bit Arm would have difficulty with 64-bit shifts/multiplies etc., but this makes no sense at all. In this test, on my machine, HEAD's pgstat_hash is 3x faster than HEAD's "strlen + hash_bytes", but for you it's 3x slower. To improve reproducibility, I've added the .sql files and a bench script to v13. I invite you to run bench_hash.sh and see if that changes anything. v13 also - adds an assert that aligned and unaligned C string calculations give the same result - properly mixes roleid in the namespace hash, since it's now convenient to do so (0005 is an alternate method) - removes the broken makefile from the benchmark (not for commit anyway)
Attachment
- v13-0005-WIP-a-safer-way-to-accumulate-a-single-struct-me.patch
- v13-0006-Add-benchmarks-for-hashing.patch
- v13-0004-Use-fasthash-for-the-search-path-cache.patch
- v13-0002-Use-fasthash-for-pgstat_hash_hash_key.patch
- v13-0003-Add-optimized-string-hashing-to-hashfn_unstable..patch
- v13-0001-Add-inlineable-incremental-hash-functions-for-in.patch
On Fri, Jan 5, 2024 at 6:54 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Jan 4, 2024 at 10:01 AM jian he <jian.universality@gmail.com> wrote: > > > > I still cannot git apply your patch cleanly. in > > I don't know why you're using that -- the git apply man page even says > > "Use git-am(1) to create commits from patches generated by > git-format-patch(1) and/or received by email." > > Or, if that fails, use "patch". > > > http://cfbot.cputube.org/ i cannot find your patch. > > ( so, it might be that I test based on incomplete information). > > but only hashfn_unstable.h influences bench_hash/bench_hash.c. > > > > so I attached the whole patch that I had git applied, that is the > > changes i applied for the following tests. > > Well, aside from the added text-editor detritus, it looks like this > has everything except v11-0008, without which I still get improvement > for the pgstat hash. > > > Model name: Intel(R) Core(TM) i5-14600K > > > The following is tested with another machine, also listed machine spec below. > > I tested 3 times, the results is very similar as following: > > select * from bench_cstring_hash_aligned(100000); 4705.686 ms > > select * from bench_cstring_hash_unaligned(100000); 6835.753 ms > > select * from bench_pgstat_hash(100000); 2678.978 ms > > select * from bench_pgstat_hash_fh(100000); 6199.017 ms > > select * from bench_string_hash(100000); 847.699 ms > > I was fully prepared to believe something like 32-bit Arm would have > difficulty with 64-bit shifts/multiplies etc., but this makes no sense > at all. In this test, on my machine, HEAD's pgstat_hash is 3x faster > than HEAD's "strlen + hash_bytes", but for you it's 3x slower. To > improve reproducibility, I've added the .sql files and a bench script > to v13. I invite you to run bench_hash.sh and see if that changes > anything. git apply has a verbose option. also personally I based on vscode editor, the color to view the changes. jian@jian:~/Desktop/pg_src/src4/postgres$ git apply $PATCHES/v13-0006-Add-benchmarks-for-hashing.patch /home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:81: indent with spaces. if (/^PG_KEYWORD\("(\w+)"/) /home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:82: indent with spaces. { /home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:87: indent with spaces. } /home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:89: trailing whitespace. /home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:92: trailing whitespace. warning: squelched 11 whitespace errors warning: 16 lines add whitespace errors. jian@jian:~/Desktop/pg_src/src4/postgres$ bash runbench.sh select * from bench_string_hash(100000); latency average = 875.482 ms select * from bench_cstring_hash_unaligned(100000); latency average = 6539.231 ms select * from bench_cstring_hash_aligned(100000); latency average = 4401.278 ms select * from bench_pgstat_hash(100000); latency average = 2679.732 ms select * from bench_pgstat_hash_fh(100000); latency average = 5711.012 ms jian@jian:~/Desktop/pg_src/src4/postgres$ bash runbench.sh select * from bench_string_hash(100000); latency average = 874.261 ms select * from bench_cstring_hash_unaligned(100000); latency average = 6538.874 ms select * from bench_cstring_hash_aligned(100000); latency average = 4400.546 ms select * from bench_pgstat_hash(100000); latency average = 2682.013 ms select * from bench_pgstat_hash_fh(100000); latency average = 5709.815 ms meson: meson setup ${BUILD} \ -Dprefix=${PG_PREFIX} \ -Dpgport=5459 \ -Dplperl=enabled \ -Dplpython=enabled \ -Dssl=openssl \ -Dldap=enabled \ -Dlibxml=enabled \ -Dlibxslt=enabled \ -Duuid=e2fs \ -Dzstd=enabled \ -Dlz4=enabled \ -Dsystemd=enabled \ -Dcassert=true \ -Db_coverage=true \ -Dicu=enabled \ -Dbuildtype=debug \ -Dwerror=true \ -Dc_args='-Wunused-variable -Wuninitialized -Werror=maybe-uninitialized -Wreturn-type -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES -DREALLOCATE_BITMAPSETS -DRAW_EXPRESSION_COVERAGE_TEST -fno-omit-frame-pointer' \ -Ddocs_pdf=disabled \ -Dllvm=disabled \ -Ddocs_pdf=disabled
On Fri, Jan 5, 2024 at 6:58 PM jian he <jian.universality@gmail.com> wrote: > -Dcassert=true \ > -Dbuildtype=debug \ These probably don't matter much for this test, but these should be off for any performance testing. > -DWRITE_READ_PARSE_PLAN_TREES > -DCOPY_PARSE_PLAN_TREES > -DREALLOCATE_BITMAPSETS > -DRAW_EXPRESSION_COVERAGE_TEST I'd guess it was was of these, which should likewise be off as well.
On Sat, Jan 6, 2024 at 9:04 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Fri, Jan 5, 2024 at 6:58 PM jian he <jian.universality@gmail.com> wrote: > > -Dcassert=true \ > > > -Dbuildtype=debug \ > > These probably don't matter much for this test, but these should be > off for any performance testing. > > > -DWRITE_READ_PARSE_PLAN_TREES > > -DCOPY_PARSE_PLAN_TREES > > -DREALLOCATE_BITMAPSETS > > -DRAW_EXPRESSION_COVERAGE_TEST > > I'd guess it was was of these, which should likewise be off as well. Thanks for pointing it out. meson setup ${BUILD} \ -Dprefix=${PG_PREFIX} \ -Dpgport=5459 \ -Dplperl=enabled \ -Dplpython=enabled \ -Dssl=openssl \ -Dldap=enabled \ -Dlibxml=enabled \ -Dlibxslt=enabled \ -Duuid=e2fs \ -Dzstd=enabled \ -Dlz4=enabled \ -Dsystemd=enabled \ -Dicu=enabled \ -Dbuildtype=release \ -Ddocs_pdf=disabled \ -Dllvm=disabled \ -Ddocs_pdf=disabled now the results: jian@jian:~/Desktop/pg_src/src4/postgres$ bash /home/jian/Desktop/pg_src/src4/postgres/runbench.sh select * from bench_string_hash(100000); latency average = 145.021 ms select * from bench_cstring_hash_unaligned(100000); latency average = 100.829 ms select * from bench_cstring_hash_aligned(100000); latency average = 100.606 ms select * from bench_pgstat_hash(100000); latency average = 96.140 ms select * from bench_pgstat_hash_fh(100000); latency average = 62.784 ms jian@jian:~/Desktop/pg_src/src4/postgres$ bash /home/jian/Desktop/pg_src/src4/postgres/runbench.sh select * from bench_string_hash(100000); latency average = 147.782 ms select * from bench_cstring_hash_unaligned(100000); latency average = 101.179 ms select * from bench_cstring_hash_aligned(100000); latency average = 101.219 ms select * from bench_pgstat_hash(100000); latency average = 96.357 ms select * from bench_pgstat_hash_fh(100000); latency average = 62.902 ms
On Sat, Jan 6, 2024 at 9:01 AM jian he <jian.universality@gmail.com> wrote: > > latency average = 147.782 ms > select * from bench_cstring_hash_unaligned(100000); > latency average = 101.179 ms > select * from bench_cstring_hash_aligned(100000); > latency average = 101.219 ms Thanks for testing again! This looks closer to my results. It doesn't show improvement for the aligned case, but it's not worse, either. There is still some polishing to be done, mostly on comments/examples, but I think it's mostly there. I'll return to it by next week.
Hi John, On Mon, Jan 8, 2024 at 10:44 AM John Naylor <johncnaylorls@gmail.com> wrote: > > On Sat, Jan 6, 2024 at 9:01 AM jian he <jian.universality@gmail.com> wrote: > > > > latency average = 147.782 ms > > select * from bench_cstring_hash_unaligned(100000); > > latency average = 101.179 ms > > select * from bench_cstring_hash_aligned(100000); > > latency average = 101.219 ms > > Thanks for testing again! This looks closer to my results. It doesn't > show improvement for the aligned case, but it's not worse, either. > > There is still some polishing to be done, mostly on comments/examples, > but I think it's mostly there. I'll return to it by next week. > > + * Portions Copyright (c) 2018-2023, PostgreSQL Global Development Group A kind reminder, it's already 2024 :) I'm also curious why the 2018, is there any convention for that? -- Regards Junwang Zhao
On Mon, Jan 8, 2024 at 2:24 PM Junwang Zhao <zhjwpku@gmail.com> wrote: > > + * Portions Copyright (c) 2018-2023, PostgreSQL Global Development Group > > A kind reminder, it's already 2024 :) > > I'm also curious why the 2018, is there any convention for that? The convention I followed was "blind copy-paste", but the first year is supposed to be when the file entered the repo. Thanks, will fix.
I spent some time rewriting the comments and a couple other cosmetic changes, and squashed into two patches: the second one has the optimized string hashing. They each have still just one demo use case. It looks pretty close to commitable, but I'll leave this up for a few days in case anyone wants to have another look. After this first step is out of the way, we can look into using this more widely, including dynahash and the GUC hash.
Attachment
On 17/01/2024 09:15, John Naylor wrote: > /* > * hashfn_unstable.h > * > * Building blocks for creating fast inlineable hash functions. The > * unstable designation is in contrast to hashfn.h, which cannot break > * compatibility because hashes can be written to disk and so must produce > * the same hashes between versions. > * > * The functions in this file are not guaranteed to be stable between > * versions, and may differ by hardware platform. These paragraphs sound a bit awkward. It kind of buries the lede, the "these functions are not guaranteed to be stable" part, to the bottom. Maybe something like: " Building blocks for creating fast inlineable hash functions. The functions in this file are not guaranteed to be stable between versions, and may differ by hardware platform. Hence they must not be used in indexes or other on-disk structures. See hashfn.h if you need stability. " typo: licencse Other than that, LGTM. -- Heikki Linnakangas Neon (https://neon.tech)
On Wed, Jan 17, 2024 at 9:54 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Maybe something like: > > " > Building blocks for creating fast inlineable hash functions. The > functions in this file are not guaranteed to be stable between versions, > and may differ by hardware platform. Hence they must not be used in > indexes or other on-disk structures. See hashfn.h if you need stability. > " > > typo: licencse > > Other than that, LGTM. Pushed that way, thanks! After fixing another typo in big endian builds, an s390x member reported green, so I think that aspect is working now. I'll come back to follow-up topics shortly.
On 19/01/2024 09:27, John Naylor wrote: > Pushed that way, thanks! After fixing another typo in big endian > builds, an s390x member reported green, so I think that aspect is > working now. I'll come back to follow-up topics shortly. Thanks! I started to look at how to use this, and I have some questions. I'd like to replace this murmurhash ussage in resowner.c with this: > /* > * Most resource kinds store a pointer in 'value', and pointers are unique > * all on their own. But some resources store plain integers (Files and > * Buffers as of this writing), so we want to incorporate the 'kind' in > * the hash too, otherwise those resources will collide a lot. But > * because there are only a few resource kinds like that - and only a few > * resource kinds to begin with - we don't need to work too hard to mix > * 'kind' into the hash. Just add it with hash_combine(), it perturbs the > * result enough for our purposes. > */ > #if SIZEOF_DATUM == 8 > return hash_combine64(murmurhash64((uint64) value), (uint64) kind); > #else > return hash_combine(murmurhash32((uint32) value), (uint32) kind); > #endif The straightforward replacement would be: fasthash_state hs; fasthash_init(&hs, sizeof(Datum), 0); fasthash_accum(&hs, (char *) &kind, sizeof(ResourceOwnerDesc *)); fasthash_accum(&hs, (char *) &value, sizeof(Datum)); return fasthash_final32(&hs, 0); But I wonder if it would be OK to abuse the 'seed' and 'tweak' parameters to the init and final functions instead, like this: fasthash_state hs; fasthash_init(&hs, sizeof(Datum), (uint64) kind); return fasthash_final32(&hs, (uint64) value); I couldn't find any guidance on what properties the 'seed' and 'tweak' have, compared to just accumulating the values with accum. Anyone know? -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, 2024-01-19 at 14:27 +0700, John Naylor wrote: > Pushed that way, thanks! Thank you. One post-commit question on 0aba255440: why do haszero64(pg_bswap64(chunk)) rather than just haszero64(chunk)? How does byteswapping affect whether a zero byte exists or not? Regards, Jeff Davis
On Fri, 2024-01-19 at 13:38 -0800, Jeff Davis wrote: > One post-commit question on 0aba255440: why do > haszero64(pg_bswap64(chunk)) rather than just haszero64(chunk)? How > does byteswapping affect whether a zero byte exists or not? I missed that it was used later when finding the rightmost one position. The placement of the comment was slightly confusing. Is: haszero64(pg_bswap64(chunk)) == pg_bswap64(haszero64(chunk)) ? If so, perhaps we can do the byte swapping outside of the loop, which might save a few cycles on longer strings and would be more readable. Regards, Jeff Davis
On Fri, Jan 19, 2024 at 11:54 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Thanks! I started to look at how to use this, and I have some questions. > I'd like to replace this murmurhash ussage in resowner.c with this: > > > /* > > * Most resource kinds store a pointer in 'value', and pointers are unique > > * all on their own. But some resources store plain integers (Files and > > * Buffers as of this writing), so we want to incorporate the 'kind' in > > * the hash too, otherwise those resources will collide a lot. But > > * because there are only a few resource kinds like that - and only a few > > * resource kinds to begin with - we don't need to work too hard to mix > > * 'kind' into the hash. Just add it with hash_combine(), it perturbs the > > * result enough for our purposes. > > */ > > #if SIZEOF_DATUM == 8 > > return hash_combine64(murmurhash64((uint64) value), (uint64) kind); > > #else > > return hash_combine(murmurhash32((uint32) value), (uint32) kind); > > #endif > > The straightforward replacement would be: > > fasthash_state hs; > > fasthash_init(&hs, sizeof(Datum), 0); > fasthash_accum(&hs, (char *) &kind, sizeof(ResourceOwnerDesc *)); > fasthash_accum(&hs, (char *) &value, sizeof(Datum)); > return fasthash_final32(&hs, 0); That would give the fullest mixing possible, more than currently. > But I wonder if it would be OK to abuse the 'seed' and 'tweak' > parameters to the init and final functions instead, like this: > > fasthash_state hs; > > fasthash_init(&hs, sizeof(Datum), (uint64) kind); > return fasthash_final32(&hs, (uint64) value); This would go in the other direction, and sacrifice some quality for speed. The fasthash finalizer is pretty short -- XMX, where X is "right shift and XOR" and M is "multiply". In looking at some other hash functions, it seems that's often done only if the input has already had some mixing. The Murmur finalizer has the shape XMXMX, and that seems to be the preferred way to get good mixing on a single register-sized value. For that reason, hash functions whose main loop is designed for long inputs will often skip that for small inputs and just go straight to a Murmur-style finalizer. Fasthash doesn't do that, so for a small input it ends up doing XMXM then XMX, which is a little more expensive. > I couldn't find any guidance on what properties the 'seed' and 'tweak' > have, compared to just accumulating the values with accum. Anyone know? In Postgres, I only know of one use of a seed parameter, to create two independent hash functions from hash_bytes_uint32_extended(), in brin-bloom indexes. I think that's the more typical use for a seed. Since there was no guidance with the existing hash functions, and it's a widespread concept, I didn't feel the need to put any here. We could change that. I modeled the finalizer tweak on one of the finalizers in xxHash that also used it only for the input length. Length is used as a tiebreaker where otherwise it will often not collide anyway, so it seems that's how we should think about using it elsewhere. There is a comment above fasthash_final64 mentioning that the tweak is used for length when that is not known ahead of time, but it might be good to generalize that, and maybe put it somewhere more prominent. With that in mind, I'm not sure "value" is a good fit for the tweak. "kind" is sort of in the middle because IIUC it doesn't mattter at all for pointer values, but it's important for other kinds, which would commonly collide. If I were to change from murmur64, I'd probably go in between the two extremes mentioned earlier, and mix "value" normally and pass "kind" as the seed: fasthash_state hs; fasthash_init(&hs, sizeof(Datum), kind); fasthash_accum(&hs, (char *) &value, sizeof(Datum)); return fasthash_final32(&hs, 0);
On Sat, Jan 20, 2024 at 7:13 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Fri, 2024-01-19 at 13:38 -0800, Jeff Davis wrote: > > One post-commit question on 0aba255440: why do > > haszero64(pg_bswap64(chunk)) rather than just haszero64(chunk)? How > > does byteswapping affect whether a zero byte exists or not? > > I missed that it was used later when finding the rightmost one > position. > > The placement of the comment was slightly confusing. Is: > > haszero64(pg_bswap64(chunk)) == pg_bswap64(haszero64(chunk)) > > ? If so, perhaps we can do the byte swapping outside of the loop, which > might save a few cycles on longer strings and would be more readable. The above identity is not true for this haszero64 macro. I phrased it as "The rest of the bytes are indeterminate", but that's not very clear. It can only be true if it set bytes for all and only those bytes where the input had zeros. In the NetBSD strlen source, there is a comment telling of a way to do this: ~(((x & 0x7f....7f) + 0x7f....7f) | (x | 0x7f....7f)) https://github.com/NetBSD/src/blob/trunk/common/lib/libc/arch/x86_64/string/strlen.S (They don't actually use it of course, since x86_64 is little-endian) From the commentary there, it sounds like 1 or 2 more instructions. One unmentioned assumption I had was that the byte swap would be a single instruction on all platforms where we care about performance (*). If that's not the case, we could switch to the above macro for big-endian machines. It'd be less readable since we'd then need an additional #ifdef for counting leading, rather than trailing zeros (that would avoid byte-swapping entirely). Either way, I'm afraid big-endian is stuck doing a bit of extra work somewhere. That work could be amortized by doing a quick check in the loop and afterwards completely redoing the zero check (or a bytewise check same as the unaligned path), but that would penalize short strings. (*) 32-bit platforms don't take this path, but mamba's build failed because the previously-misspelled symbol was still in the source file. We could also #ifdef around the whole aligned-path function, although it's redundant. I hope this makes it more clear. Maybe the comment could use some work.
On Sat, 2024-01-20 at 13:48 +0700, John Naylor wrote: > The above identity is not true for this haszero64 macro. I see. > I hope this makes it more clear. Maybe the comment could use some > work. Yes, thank you. I don't think we need to change the algorithm. After having stepped away from this work for a couple weeks and returning to it, I think the comments and/or naming could be more clear. We first use the result of haszero64() as a boolean to break out of the loop, but then later use it in a more interesting way to count the number of remaining bytes. Perhaps you can take the comment out of the loop and just describe the algorithm we're using, and make a note that we have to byteswap first. "Indeterminate" could be explained briefly as well. These are minor comments. Regards, Jeff Davis
I wrote: > fasthash_init(&hs, sizeof(Datum), kind); > fasthash_accum(&hs, (char *) &value, sizeof(Datum)); > return fasthash_final32(&hs, 0); It occurred to me that it's strange to have two places that length can be passed. That was a side effect of the original, which used length to both know how many bytes to read, and to modify the internal seed. With the incremental API, it doesn't make sense to pass the length (or a dummy macro) up front -- with a compile-time fixed length, it can't possibly break a tie, so it's just noise. 0001 removes the length from initialization in the incremental interface. The standalone functions use length directly the same as before, but after initialization. Thoughts? Also, the fasthash_accum call is a bit verbose, because it's often used in a loop with varlen input. For register-sized values, I think it's simpler to say this, as done in the search path cache, so maybe a comment to that effect would be helpful: hs.accum = value; fasthash_combine(&hs); I noticed that we already have a more recent, stronger 64-bit mixer than murmur64: splitmix64, in pg_prng.c. We could put that, as well as a better 4-byte mixer [1] in hashfn_unstable.h, for in-memory use. Maybe with names like "hash_4bytes" etc. so it's not tied to a specific implementation. I see one simplehash case that can use it, even if the resowner hash table gets rid of it. 0002 and 0003 use fasthash for dynahash and GUC hash, respectively. These cannot use the existing cstring hashing directly because of truncation and case-folding, respectively. (Some simplehash uses can, but that can come later) On Sun, Jan 21, 2024 at 8:06 AM Jeff Davis <pgsql@j-davis.com> wrote: > > After having stepped away from this work for a couple weeks and > returning to it, I think the comments and/or naming could be more > clear. We first use the result of haszero64() as a boolean to break out > of the loop, but then later use it in a more interesting way to count > the number of remaining bytes. > > Perhaps you can take the comment out of the loop and just describe the > algorithm we're using, and make a note that we have to byteswap first. > "Indeterminate" could be explained briefly as well. v15-0004 is a stab at that. As an idea, it also renames zero_bytes_le to zero_byte_low to reflect the effect better. There might be some other comment edits needed to explain usage, so I plan to hold on to this for later. Let me know what you think. [1] Examples of both in https://www.boost.org/doc/libs/1_84_0/boost/container_hash/detail/hash_mix.hpp
Attachment
On Mon, 2024-01-22 at 09:03 +0700, John Naylor wrote: > v15-0004 is a stab at that. As an idea, it also renames zero_bytes_le > to zero_byte_low to reflect the effect better. There might be some > other comment edits needed to explain usage, so I plan to hold on to > this for later. Let me know what you think. 0004 looks good to me. No urgency so feel free to hold it until a convenient time. Regards, Jeff Davis
On Sun, 21 Jan 2024 at 03:06, Jeff Davis <pgsql@j-davis.com> wrote: > Yes, thank you. I don't think we need to change the algorithm. Jumping in here at a random point just to share my findings from poking around this on and off. I am concentrating here on cstring hashing as that is the most complicated one. One thing that caught my eye in testing was that the unaligned cstring code was unexpectedly faster for short strings (3-18B uniform distribution). Looking into it the cause was fasthash_accum() called in the final iteration. In the unaligned case compiler (clang-15) unrolled the inner loop which allowed it to jump directly into the correct place in the switch. In the unaligned case clang decided to use a data dependent jump which then mispredicts all of the time. But given that we know the data length and we have it in a register already, it's easy enough to just mask out data past the end with a shift. See patch 1. Performance benefit is about 1.5x Measured on a small test harness that just hashes and finalizes an array of strings, with a data dependency between consecutive hashes (next address depends on the previous hash output). Unaligned case can actually take advantage of the same trick as the aligned case, it just has to shuffle the data from two consecutive words before applying the combine function. Patch 2 implements this. It makes the unaligned case almost as fast as the aligned one, both on short and long strings. 10% benefit on short strings, 50% on long ones. Not sure if the second one is worth the extra code. A different approach would be to use the simple word at a time hashing for the unaligned case too and handle word accesses that straddle a page boundary as a special case. Obviously this only makes sense for platforms that support unaligned access. On x86 unaligned access within a cache line is basically free, and across cache lines is only slightly more expensive. On benchmarks calling the aligned code on unaligned strings only has a 5% penalty on long strings, short ones are indistinguishable. I also took a look at using SIMD for implementing the hash using the same aligned access + shuffle trick. The good news is that the shuffling works well enough that neither it nor checking for string end are the longest chain. The bad news is that the data load, alignment, zero finding and masking form a big dependency chain on the first iteration. Mixing and finalization is even worse, fasthash uses 64bit imul instruction that has a 3 cycle latency, the iteration to iteration chain is imul + xor, for 4 cycles or 2 B/cycle (in practice a bit less due to ALU port contention). In SIMD registers there is no 64bit multiply, and 32 bit multiply has a terrible 10 cycle latency on Intel. AES instructions are an interesting option, but it seems that 2 are needed for good enough mixing, at 4 cycles each, we again end up at 2B/cycle. Finalization needs another 3 AES instructions, a shuffle and a xor fold to pass SMHasher, for 17 cycles. The mix latency issue could be worked around by doing more mixing in parallel, potentially up to 8x faster, but this does not help short strings at all and would make the code way bigger. SIMD code does use fewer instructions so it interleaves better with nearby code that is not dependent on it, not sure if that matters anywhere. The short version is that for very long (4k+) strings the attached SIMD code is 35% faster, for short strings it is 35% slower, and this is very much x86-64-v3 only and would need a fallback when AVX and AES-NI are not available. Basically a dead end for the use cases this hash function is used for. Regards, Ants Aasma
Attachment
On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > But given that we know the data length and we have it in a register > already, it's easy enough to just mask out data past the end with a > shift. See patch 1. Performance benefit is about 1.5x Measured on a > small test harness that just hashes and finalizes an array of strings, > with a data dependency between consecutive hashes (next address > depends on the previous hash output). Interesting work! I've taken this idea and (I'm guessing, haven't tested) improved it by re-using an intermediate step for the conditional, simplifying the creation of the mask, and moving the bitscan out of the longest dependency chain. Since you didn't attach the test harness, would you like to run this and see how it fares? (v16-0001 is same as your 0001, and v16-0002 builds upon it.) I plan to test myself as well, but since your test tries to model true latency, I'm more interested in that one. > Not sure if the second one is worth the extra code. I'd say it's not worth optimizing the case we think won't be taken anyway. I also like having a simple path to assert against.
Attachment
On Tue, 30 Jan 2024 at 12:04, John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > > But given that we know the data length and we have it in a register > > already, it's easy enough to just mask out data past the end with a > > shift. See patch 1. Performance benefit is about 1.5x Measured on a > > small test harness that just hashes and finalizes an array of strings, > > with a data dependency between consecutive hashes (next address > > depends on the previous hash output). > > Interesting work! I've taken this idea and (I'm guessing, haven't > tested) improved it by re-using an intermediate step for the > conditional, simplifying the creation of the mask, and moving the > bitscan out of the longest dependency chain. Since you didn't attach > the test harness, would you like to run this and see how it fares? > (v16-0001 is same as your 0001, and v16-0002 builds upon it.) I plan > to test myself as well, but since your test tries to model true > latency, I'm more interested in that one. It didn't calculate the same result because the if (mask) condition was incorrect. Changed it to if (chunk & 0xFF) and removed the right shift from the mask. It seems to be half a nanosecond faster, but as I don't have a machine set up for microbenchmarking it's quite close to measurement noise. I didn't post the harness as it's currently so messy to be near useless to others. But if you'd like to play around, I can tidy it up a bit and post it. > > Not sure if the second one is worth the extra code. > > I'd say it's not worth optimizing the case we think won't be taken > anyway. I also like having a simple path to assert against. Agreed. As an addendum, I couldn't resist trying out using 256bit vectors with two parallel AES hashes running, unaligned loads with special casing page boundary straddling loads. Requires -march=x86-64-v3 -maes. About 20% faster than fasthash on short strings, 2.2x faster on 4k strings. Right now requires 4 bytes alignment (uses vpmaskmovd), but could be made to work with any alignment. Regards, Ants Aasma
Attachment
On Tue, Jan 30, 2024 at 7:51 PM Ants Aasma <ants.aasma@cybertec.at> wrote: > > It didn't calculate the same result because the if (mask) condition > was incorrect. Changed it to if (chunk & 0xFF) and removed the right > shift from the mask. Yes, you're quite right. > It seems to be half a nanosecond faster, but as I > don't have a machine set up for microbenchmarking it's quite close to > measurement noise. With my "throughput-ush" test, they look good: pgbench -n -T 20 -f bench_cstr_aligned.sql -M prepared | grep latency master: latency average = 490.722 ms (Ants Aantsma) v-17 0001: latency average = 385.263 ms v17 0001+0002: latency average = 339.506 ms > I didn't post the harness as it's currently so messy to be near > useless to others. But if you'd like to play around, I can tidy it up > a bit and post it. I'd be curious, thanks.
Attachment
I wrote: > > It occurred to me that it's strange to have two places that length can > be passed. That was a side effect of the original, which used length > to both know how many bytes to read, and to modify the internal seed. > With the incremental API, it doesn't make sense to pass the length (or > a dummy macro) up front -- with a compile-time fixed length, it can't > possibly break a tie, so it's just noise. This was a wart, so pushed removing initial length from the incremental API. On Mon, Jan 22, 2024 at 11:16 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Mon, 2024-01-22 at 09:03 +0700, John Naylor wrote: > > v15-0004 is a stab at that. As an idea, it also renames zero_bytes_le > > to zero_byte_low to reflect the effect better. There might be some > > other comment edits needed to explain usage, so I plan to hold on to > > this for later. Let me know what you think. > > 0004 looks good to me. No urgency so feel free to hold it until a > convenient time. Thanks for looking, I pushed this along with an expanded explanation of usage. > 0002 and 0003 use fasthash for dynahash and GUC hash, respectively. > These cannot use the existing cstring hashing directly because of > truncation and case-folding, respectively. (Some simplehash uses can, > but that can come later) I've re-attached these as well as a cleaned-up version of the tail optimization. For the CF entry, the GUC hash function in this form might only be necessary if we went ahead with simple hash. We don't yet have a new benchmark to show if that's still worthwhile after 867dd2dc87 improved the one upthread. For dynahash, one tricky part seems to be the comment about the default and when it was an assertion error. I've tried to reword this, but maybe needs work. When that's in shape, I'll incorporate removing other strlen calls.
Attachment
On 22.01.24 03:03, John Naylor wrote: > I wrote: >> fasthash_init(&hs, sizeof(Datum), kind); >> fasthash_accum(&hs, (char *) &value, sizeof(Datum)); >> return fasthash_final32(&hs, 0); > It occurred to me that it's strange to have two places that length can > be passed. That was a side effect of the original, which used length > to both know how many bytes to read, and to modify the internal seed. > With the incremental API, it doesn't make sense to pass the length (or > a dummy macro) up front -- with a compile-time fixed length, it can't > possibly break a tie, so it's just noise. > > 0001 removes the length from initialization in the incremental > interface. The standalone functions use length directly the same as > before, but after initialization. Thoughts? Unrelated related issue: src/include/common/hashfn_unstable.h currently causes warnings from cpluspluscheck: /tmp/cirrus-ci-build/src/include/common/hashfn_unstable.h: In function ‘int fasthash_accum_cstring_unaligned(fasthash_state*, const char*)’: /tmp/cirrus-ci-build/src/include/common/hashfn_unstable.h:201:20: warning: comparison of integer expressions of different signedness: ‘int’ and ‘long unsigned int’ [-Wsign-compare] 201 | while (chunk_len < FH_SIZEOF_ACCUM && str[chunk_len] != '\0') | ^ and a few more like that. I think it would be better to declare various int variables and arguments as size_t instead. Even if you don't actually need the larger range, it would make it more self-documenting.
On Wed, Feb 7, 2024 at 10:41 PM Peter Eisentraut <peter@eisentraut.org> wrote: > > /tmp/cirrus-ci-build/src/include/common/hashfn_unstable.h: In function > ‘int fasthash_accum_cstring_unaligned(fasthash_state*, const char*)’: > /tmp/cirrus-ci-build/src/include/common/hashfn_unstable.h:201:20: > warning: comparison of integer expressions of different signedness: > ‘int’ and ‘long unsigned int’ [-Wsign-compare] > 201 | while (chunk_len < FH_SIZEOF_ACCUM && str[chunk_len] != '\0') > | ^ > > and a few more like that. > > I think it would be better to declare various int variables and > arguments as size_t instead. Even if you don't actually need the larger > range, it would make it more self-documenting. Thanks for the report! I can reproduce and have pushed that change.
On Tue, Jan 30, 2024 at 5:04 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > > But given that we know the data length and we have it in a register > > already, it's easy enough to just mask out data past the end with a > > shift. See patch 1. Performance benefit is about 1.5x Measured on a > > small test harness that just hashes and finalizes an array of strings, > > with a data dependency between consecutive hashes (next address > > depends on the previous hash output). > > Interesting work! I've taken this idea and (I'm guessing, haven't > tested) improved it by re-using an intermediate step for the > conditional, simplifying the creation of the mask, and moving the > bitscan out of the longest dependency chain. This needed a rebase, and is now 0001. I plan to push this soon. I also went and looked at the simplehash instances and found a few that would be easy to switch over. Rather than try to figure out which could benefit from shaving cycles, I changed all the string hashes, and one more, in 0002 so they can act as examples. 0003 uses fasthash for resowner, as suggested by Heikki upthread. Now murmur64 has no callers, but it (or similar *) could be used in pg_dump/common.c for hashing CatalogId (8 bytes). Commit 42a1de3013 added a new use for string_hash, but I can't tell from a quick glance whether it uses the truncation, so I'm going to take a closer look before re-attaching the proposed dynahash change again. * some examples here: https://www.boost.org/doc/libs/1_84_0/boost/container_hash/detail/hash_mix.hpp
Attachment
On Tue, Mar 5, 2024 at 5:30 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 5:04 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > > > But given that we know the data length and we have it in a register > > > already, it's easy enough to just mask out data past the end with a > > > shift. See patch 1. Performance benefit is about 1.5x Measured on a > > > small test harness that just hashes and finalizes an array of strings, > > > with a data dependency between consecutive hashes (next address > > > depends on the previous hash output). > > > > Interesting work! I've taken this idea and (I'm guessing, haven't > > tested) improved it by re-using an intermediate step for the > > conditional, simplifying the creation of the mask, and moving the > > bitscan out of the longest dependency chain. > > This needed a rebase, and is now 0001. I plan to push this soon. I held off on this because CI was failing, but it wasn't because of this. > I also went and looked at the simplehash instances and found a few > that would be easy to switch over. Rather than try to figure out which > could benefit from shaving cycles, I changed all the string hashes, > and one more, in 0002 so they can act as examples. This was the culprit. The search path cache didn't trigger this when it went in, but it seems for frontend a read past the end of malloc fails -fsantize=address. By the same token, I'm guessing the only reason this didn't fail for backend is because almost all strings you'd want to use as a hash key won't use a malloc'd external block. I found that adding __attribute__((no_sanitize_address)) to fasthash_accum_cstring_aligned() passes CI. While this kind of exception is warned against (for good reason), I think it's fine here given that glibc and NetBSD, and probably others, do something similar for optimized strlen(). Before I write the proper macro for that, are there any objections? Better ideas? > Commit 42a1de3013 added a new use for string_hash, but I can't tell > from a quick glance whether it uses the truncation, so I'm going to > take a closer look before re-attaching the proposed dynahash change > again. After looking, I think the thing to do here is create a hashfn_unstable.c file for global functions: - hash_string() to replace all those duplicate definitions of hash_string_pointer() in all the frontend code - hash_string_with_limit() for dynahash and dshash.
On Wed, 2024-03-20 at 14:26 +0700, John Naylor wrote: > This was the culprit. The search path cache didn't trigger this when > it went in, but it seems for frontend a read past the end of malloc > fails -fsantize=address. By the same token, I'm guessing the only > reason this didn't fail for backend is because almost all strings > you'd want to use as a hash key won't use a malloc'd external block. > > I found that adding __attribute__((no_sanitize_address)) to > fasthash_accum_cstring_aligned() passes CI. While this kind of > exception is warned against (for good reason), I think it's fine here > given that glibc and NetBSD, and probably others, do something > similar > for optimized strlen(). Before I write the proper macro for that, are > there any objections? Better ideas? It appears that the spelling no_sanitize_address is deprecated in clang[1] in favor of 'no_sanitize("address")'. It doesn't appear to be deprecated in gcc[2]. Aside from that, +1. Regards, Jeff Davis [1] https://clang.llvm.org/docs/AddressSanitizer.html#disabling-instrumentation-with-attribute-no-sanitize-address [2] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
On Wed, Mar 20, 2024 at 11:01 PM Jeff Davis <pgsql@j-davis.com> wrote: > > > I found that adding __attribute__((no_sanitize_address)) to > > fasthash_accum_cstring_aligned() passes CI. While this kind of > > exception is warned against (for good reason), I think it's fine here > > given that glibc and NetBSD, and probably others, do something > > similar > > for optimized strlen(). Before I write the proper macro for that, are > > there any objections? Better ideas? > > It appears that the spelling no_sanitize_address is deprecated in > clang[1] in favor of 'no_sanitize("address")'. It doesn't appear to be > deprecated in gcc[2]. Thanks for the pointers! In v20-0001, I've drafted checking thes spelling first, since pg_attribute_no_sanitize_alignment has a similar version check. Then it checks for no_sanitize_address using __has_attribute, which goes back to gcc 5. That's plenty for the buildfarm and CI, and I'm not sure it's worth expending additional effort to cover more cases. (A similar attribute exists for MSVC in case it comes up.) v21-0003 adds a new file hashfn_unstable.c for convenience functions and converts all the duplicate frontend uses of hash_string_pointer. This will be where a similar hash_string_with_len will live for dynash/dshash, which I tested some time ago. I haven't decided whether to merge that earlier work here or keep it in a separate patch, but regardless of how 0003 ends up I'd like to push 0001/0002 shortly.
Attachment
On Wed, 2024-03-27 at 13:44 +0700, John Naylor wrote: > Thanks for the pointers! In v20-0001, I've drafted checking thes > spelling first, since pg_attribute_no_sanitize_alignment has a > similar > version check. Then it checks for no_sanitize_address using > __has_attribute, which goes back to gcc 5. That's plenty for the > buildfarm and CI, and I'm not sure it's worth expending additional > effort to cover more cases. (A similar attribute exists for MSVC in > case it comes up.) 0001 looks good to me, thank you. > v21-0003 adds a new file hashfn_unstable.c for convenience functions > and converts all the duplicate frontend uses of hash_string_pointer. Why not make hash_string() inline, too? I'm fine with it either way, I'm just curious why you went to the trouble to create a new .c file so it didn't have to be inlined. Regards, Jeff Davis
On Thu, Mar 28, 2024 at 12:37 PM Jeff Davis <pgsql@j-davis.com> wrote: > > > v21-0003 adds a new file hashfn_unstable.c for convenience functions > > and converts all the duplicate frontend uses of hash_string_pointer. > > Why not make hash_string() inline, too? I'm fine with it either way, > I'm just curious why you went to the trouble to create a new .c file so > it didn't have to be inlined. Yeah, it's a bit strange looking in isolation, and I'm not sure I'll go that route. When I was thinking of this, I also had dynahash and dshash in mind, which do indirect calls, even if the function is defined in the same file. That would still work with an inline definition in the header, just duplicated in the different translation units. Maybe that's not worth worrying about, since I imagine use cases with indirect calls will remain rare.
On Tue, Mar 5, 2024 at 5:30 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 5:04 PM John Naylor <johncnaylorls@gmail.com> wrote: > > > > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > > > But given that we know the data length and we have it in a register > > > already, it's easy enough to just mask out data past the end with a > > > shift. See patch 1. Performance benefit is about 1.5x Measured on a > > > small test harness that just hashes and finalizes an array of strings, > > > with a data dependency between consecutive hashes (next address > > > depends on the previous hash output). > > > > Interesting work! I've taken this idea and (I'm guessing, haven't > > tested) improved it by re-using an intermediate step for the > > conditional, simplifying the creation of the mask, and moving the > > bitscan out of the longest dependency chain. > > This needed a rebase, and is now 0001. I plan to push this soon. I pushed but had to revert -- my version (and I believe both) failed to keep the invariant that the aligned and unaligned must result in the same hash. It's clear to me how to fix, but I've injured my strong hand and won't be typing much in for a cuople days. I'll prioritize the removal of strlen calls for v17, since the optimization can wait and there is also a valgrind issue I haven't looked into.
On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aasma@cybertec.at> wrote: > > But given that we know the data length and we have it in a register > already, it's easy enough to just mask out data past the end with a > shift. See patch 1. Performance benefit is about 1.5x Measured on a > small test harness that just hashes and finalizes an array of strings, > with a data dependency between consecutive hashes (next address > depends on the previous hash output). I pushed this with a couple cosmetic adjustments, after fixing the endianness issue. I'm not sure why valgrind is fine with this way, and the other ways I tried forming the (little-endian) mask raised errors. In addition to "zero_byte_low | (zero_byte_low - 1)", I tried "~zero_byte_low & (zero_byte_low - 1)" and "zero_byte_low ^ (zero_byte_low - 1)" to no avail. On Thu, Mar 28, 2024 at 12:37 PM Jeff Davis <pgsql@j-davis.com> wrote: > 0001 looks good to me, thank you. > > > v21-0003 adds a new file hashfn_unstable.c for convenience functions > > and converts all the duplicate frontend uses of hash_string_pointer. > > Why not make hash_string() inline, too? I'm fine with it either way, > I'm just curious why you went to the trouble to create a new .c file so > it didn't have to be inlined. Thanks for looking! I pushed these, with hash_string() inlined. I've attached (not reindented for clarity) an update of something mentioned a few times already -- removing strlen calls for dynahash and dshash string keys. I'm not quite sure how the comments should be updated about string_hash being deprecated to call directly. This patch goes further and semi-deprecates calling it at all, so these comments seem a bit awkward now.
Attachment
Hi! Found that https://github.com/postgres/postgres/commit/0aba2554409ee3251d7558567edd114d8ed36dcc produces a valgrind error in initdb. Such a steps: CPPFLAGS="-DUSE_VALGRIND -Og" ./configure --enable-debug --enable-tap-tests --enable-cassert --with-icu make ... valgrind --quiet --exit-on-first-error=yes --error-exitcode=1 --leak-check=no --time-stamp=yes \ --gen-suppressions=all --trace-children=yes <path-to>/initdb -k -D <path-to>/data give an error: running bootstrap script ... ok performing post-bootstrap initialization ... ==00:00:00:01.856 967784== Conditional jump or move depends on uninitialisedvalue(s) ==00:00:00:01.856 967784== at 0x2F41F4: fasthash_accum (hashfn_unstable.h:136) ==00:00:00:01.856 967784== by 0x2F41F4: fasthash_accum_cstring_aligned (hashfn_unstable.h:247) ==00:00:00:01.856 967784== by 0x2F41F4: fasthash_accum_cstring (hashfn_unstable.h:271) ==00:00:00:01.856 967784== by 0x2F41F4: spcachekey_hash (namespace.c:268) ==00:00:00:01.856 967784== by 0x2F479F: nsphash_lookup (simplehash.h:836) ==00:00:00:01.856 967784== by 0x2F479F: spcache_insert (namespace.c:379) ==00:00:00:01.856 967784== by 0x2F533C: cachedNamespacePath (namespace.c:4236) ==00:00:00:01.856 967784== by 0x2F5425: recomputeNamespacePath (namespace.c:4294) ==00:00:00:01.856 967784== by 0x2F5516: RelnameGetRelid (namespace.c:875) ==00:00:00:01.856 967784== by 0x2F6CD5: RangeVarGetRelidExtended (namespace.c:524) ==00:00:00:01.856 967784== by 0x2DD1C7: objectNamesToOids (aclchk.c:701) ==00:00:00:01.856 967784== by 0x2E2A9D: ExecuteGrantStmt (aclchk.c:441) ==00:00:00:01.856 967784== by 0x61FF62: ProcessUtilitySlow (utility.c:1816) ==00:00:00:01.856 967784== by 0x61E948: standard_ProcessUtility (utility.c:973) ==00:00:00:01.856 967784== by 0x61EC1A: ProcessUtility (utility.c:530) ==00:00:00:01.856 967784== by 0x61C059: PortalRunUtility (pquery.c:1158) ==00:00:00:01.856 967784== { <insert_a_suppression_name_here> Memcheck:Cond fun:fasthash_accum fun:fasthash_accum_cstring_aligned fun:fasthash_accum_cstring fun:spcachekey_hash fun:nsphash_lookup fun:spcache_insert fun:cachedNamespacePath fun:recomputeNamespacePath fun:RelnameGetRelid fun:RangeVarGetRelidExtended fun:objectNamesToOids fun:ExecuteGrantStmt fun:ProcessUtilitySlow fun:standard_ProcessUtility fun:ProcessUtility fun:PortalRunUtility } ==00:00:00:01.856 967784== ==00:00:00:01.856 967784== Exit program on first error (--exit-on-first-error=yes) child process exited with exit code 1 The current master at b7493e1 also has this error. With the best regards, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Dec 19, 2024 at 7:10 AM Anton A. Melnikov <a.melnikov@postgrespro.ru> wrote: > Found that https://github.com/postgres/postgres/commit/0aba2554409ee3251d7558567edd114d8ed36dcc > produces a valgrind error in initdb. What architecture and valgrind version is this? We've been bitten before by different results on Arm vs x86. The offending code is not even my preferred way to handle the last word of the string (see f4ad0021af), so if the current way is still not valgrind-clean, I wonder if we should give up and add an exception, since we know any garbage bits are masked off. -- John Naylor Amazon Web Services
I wrote: > The offending code is not even my preferred way to handle the last > word of the string (see f4ad0021af), so if the current way is still > not valgrind-clean, I wonder if we should give up and add an > exception, since we know any garbage bits are masked off. That would actually be a maintenance headache because the function is inlined, but here's a better idea: We already have a fallback path for when the string is not suitably aligned, or in 32-bit builds. We could just use that under Valgrind: static inline size_t fasthash_accum_cstring(fasthash_state *hs, const char *str) { -#if SIZEOF_VOID_P >= 8 +#if SIZEOF_VOID_P >= 8 && !defined(USE_VALGRIND) Any objections? -- John Naylor Amazon Web Services