Re: Hash index build performance tweak from sorting - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Hash index build performance tweak from sorting |
Date | |
Msg-id | CAApHDvpiPzg8-gGAHasNDHmBsXF5+_gd+8RYyXNv6YcSo-6cpQ@mail.gmail.com Whole thread Raw |
In response to | Re: Hash index build performance tweak from sorting (Simon Riggs <simon.riggs@enterprisedb.com>) |
Responses |
Re: Hash index build performance tweak from sorting
|
List | pgsql-hackers |
On Wed, 16 Nov 2022 at 17:33, Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > Thanks for the review, apologies for the delay in acting upon your comments. > > My tests show the sorted and random tests are BOTH 4.6% faster with > the v3 changes using 5-test avg, but you'll be pleased to know your > kit is about 15.5% faster than mine, comparing absolute execution > times. Thanks for the updated patch. I started to look at this again and I'm starting to think that the HashInsertState struct is the wrong approach for passing along the sorted flag to _hash_doinsert(). The reason I think this is that in hashbuild() when setting buildstate.spool to NULL, you're also making the decision about what to set the sorted flag to. However, in reality, we already know what we should be passing *every* time we call _hash_doinsert(). The only place where we can pass the sorted option as true is in _h_indexbuild() when we're doing the sorted version of the index build. Trying to make that decision any sooner seems error-prone. I understand you have made HashInsertState so that we don't need to add new parameters should we ever need to pass something else along, but I'm just thinking that if we ever need to add more, then we should just reconsider this in the future. I think for today, the better option is just to add the bool sorted as a parameter to _hash_doinsert() and pass it as true in the single case where it's valid to do so. That seems less likely that we'll inherit some options from some other place after some future modification and end up passing sorted as true when it should be false. Another reason I didn't like the HashInsertState idea is that in the v3 patch there's an HashInsertState in both HashBuildState and HSpool. Because in the normal insert path (hashinsert), we've neither a HashBuildState nor an HSpool, you're having to fake up a HashInsertStateData to pass something along to _hash_doinsert() in hashinsert(). When we're building an index, in the non-sorted index build case, you're always passing the HashInsertStateData from the HashBuildState, but when we're doing the sorted index build the one from HSpool is passed. In other words, in each of the 3 calls to _hash_doinsert(), the HashInsertStateData comes from a different place. Now, I do see that you've coded hashbuild() so both versions of the HashInsertState point to the same HashInsertStateData, but I find it unacceptable programming that in _h_spoolinit() the code palloc's the memory for the HSpool and you're setting the istate field to the HashInsertStateData that's on the stack. That just seems like a great way to end up having istate pointing to junk should the HSpool ever live beyond the hashbuild() call. If we really don't want HSpool to live beyond hashbuild(), then it too should be a local variable to hashbuild() instead of being palloc'ed in _h_spoolinit(). _h_spoolinit() could just be passed a pointer to the HSpool to populate. After getting rid of the HashInsertState code and just adding bool sorted to _hash_doinsert() and _hash_pgaddtup(), the resulting patch is much more simple: v3: src/backend/access/hash/hash.c | 19 ++++++++++++++++--- src/backend/access/hash/hashinsert.c | 40 ++++++++++++++++++++++++++++++++++------ src/backend/access/hash/hashsort.c | 8 ++++++-- src/include/access/hash.h | 14 +++++++++++--- 4 files changed, 67 insertions(+), 14 deletions(-) v4: src/backend/access/hash/hash.c | 4 ++-- src/backend/access/hash/hashinsert.c | 40 ++++++++++++++++++++++++++++-------- src/backend/access/hash/hashsort.c | 3 ++- src/include/access/hash.h | 6 ++++-- 4 files changed, 40 insertions(+), 13 deletions(-) and v4 includes 7 extra lines in hashinsert.c for the Assert() I mentioned in my previous email plus a bunch of extra comments. I'd rather see this solved like v4 is doing it. David
Attachment
pgsql-hackers by date: