Thread: Re: Reduce TupleHashEntryData struct size by half
On 18/11/2024 02:01, Jeff Davis wrote: > > TupleHashEntryData is the size of a hash bucket, and it's currently 24 > bytes. The size of the structure is important for HashAgg and other > callers that can build up large hashtables of small tuples. Waste in > this structure is even worse when the hash table is sparse, because the > space is consumed whether the bucket is occupied or not. > > The attached patch series brings it down to 12. There are several > unappealing hacks, so ideally we'd find something cleaner, but halving > the bucket size seems quite desirable if it can be done in an > acceptable way. Makes sense. > 0001: This patch makes the structure private, so that it's easier to > control the way the structure is accessed. Seems pretty uncontroversial. You removed the typedef for struct TupleHashEntryData, which is a bit unusual for our usual source style. Was there a reason for that? > 0002: Removes the "additional" pointer, instead storing it right after > the tuple, which is already stored in a separate chunk. Hack: this > crams the "additional" pointer after the MinimalTuple in the same chunk > of memory to avoid adding additional palloc()s. Hmm, it would seem more straightforward to store it in the beginning, i.e. have something like this: struct { void *additional; MinimalTupleData mtup; } ; Come to think of it, how important is it that we use MinimalTuple here at all? Some other representation could be faster to deal with in TupleHashTableMatch() anyway. > 0003: simplehash: allow the caller to decide which entries are empty vs > in-use rather than requiring a separate "status" field. This may limit > other possible status values in the future (i.e. adding on to the > enum), but I'm not sure what those other stutus values might be. +1. I've wanted to have this in the past. > 0004: Removes the "status" field from TupleHashEntryData, using > firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an > additional "special" pointer value to mean "input slot" now that NULL > means "empty". +1 > 0005: Pack TupleHashEntryData. IIUC, this is fine even on machines that > can't do unaligned access, so long as we are accessing the fields > through the struct, and not taking the address of individual members. Seems OK. I wonder if the compiler understands that the elements are still 4-byte aligned, or if it forces byte-per-byte access? Playing with godbolt a little, it seems like GCC at least understands it, but clang does not. On architectures with non-strict alignment, it doesn't matter as a simple load/store instruction is the fastest option anyway. -- Heikki Linnakangas Neon (https://neon.tech)
On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote: > Seems pretty uncontroversial. You removed the typedef for struct > TupleHashEntryData, which is a bit unusual for our usual source > style. > Was there a reason for that? If it's private to a file and I don't intend to use it a lot, I do it to cut down typedefs.list bloat. I'll go ahead and add the typedef back to match the style, though. > > Hmm, it would seem more straightforward to store it in the beginning, > i.e. have something like this: > > struct { > void *additional; > MinimalTupleData mtup; > } ; That was my first approach, but it requires an additional memcpy, because ExecCopySlotMinimalTuple() does it's own palloc. I used repalloc() because it will often have space at the end of the chunk anyway, and not need to memcpy(). Maybe that's not significant but it did seem detectable in some perf tests. But perhaps we can go further and get rid of the "additional" pointer and inline the pergroup data and the grouping key tuple into the same palloc chunk? That would cut out a palloc and the 8 wasted bytes on the pointer. > Come to think of it, how important is it that we use MinimalTuple > here > at all? Some other representation could be faster to deal with in > TupleHashTableMatch() anyway. What did you have in mind? That sounds like a good idea orthogonal to reducing the bucket size. Alternatively, MinimalTuple is not very "minimal", and perhaps we can just make it better. > > > 0004: Removes the "status" field from TupleHashEntryData, using > > firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an > > additional "special" pointer value to mean "input slot" now that > > NULL > > means "empty". > > +1 For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use what I did: const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1; or something like: static MinimalTupleData dummy = {0}; const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy; ? > > On architectures with non-strict alignment, it doesn't matter as a > simple load/store instruction is the fastest option anyway. My intuition is that the cost of dereferencing that pointer (to memory which is not expected to be in cache) is going to be way higher than the cost of a couple extra instructions to do the unaligned access. Regards, Jeff Davis
On 18/11/2024 22:22, Jeff Davis wrote: > On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote: >> Hmm, it would seem more straightforward to store it in the beginning, >> i.e. have something like this: >> >> struct { >> void *additional; >> MinimalTupleData mtup; >> } ; > > That was my first approach, but it requires an additional memcpy, > because ExecCopySlotMinimalTuple() does it's own palloc. I used > repalloc() because it will often have space at the end of the chunk > anyway, and not need to memcpy(). Maybe that's not significant but it > did seem detectable in some perf tests. > > But perhaps we can go further and get rid of the "additional" pointer > and inline the pergroup data and the grouping key tuple into the same > palloc chunk? That would cut out a palloc and the 8 wasted bytes on the > pointer. Sounds like a good idea. Needs some changes to the TupleTableSlotOps interface to avoid the memcpy I presume. >> Come to think of it, how important is it that we use MinimalTuple >> here >> at all? Some other representation could be faster to deal with in >> TupleHashTableMatch() anyway. > > What did you have in mind? That sounds like a good idea orthogonal to > reducing the bucket size. Queries that have a only a small number of groups might might benefit from storing a plain Datum/isnull array instead of a MinimalTuple. That would take more memory when you have a lot of groups though. > Alternatively, MinimalTuple is not very "minimal", and perhaps we can > just make it better. Yeah. It tries to be compatible with HeapTuple, but perhaps we should give up on that and pack it more tightly instead. >>> 0004: Removes the "status" field from TupleHashEntryData, using >>> firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an >>> additional "special" pointer value to mean "input slot" now that >>> NULL >>> means "empty". >> >> +1 > > For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use > what I did: > > const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1; > > or something like: > > static MinimalTupleData dummy = {0}; > const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy; > > ? I think I'd do "(MinimalTuple) 0x1" myself, but no strong opinion. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, 13 Feb 2025 at 14:01, Jeff Davis <pgsql@j-davis.com> wrote: > Attached a new patchset that doesn't change the API for > ExecCopySlotMinimalTuple(). Instead, I'm using > ExecFetchSlotMinimalTuple(), which avoids the extra memcpy if the slot > is TTSOpsMinimalTuple, which is what HashAgg uses. I separated out the > change to use ExecFetchSlotMinimalTuple() into 0004 to illustrate the > performance impact. Some review comments: * my_log2() takes a "long" parameter type but transitionSpace is a "Size". These types aren't the same width on all platforms we support. Maybe pg_nextpower2_size_t() is a better option. * Should the following have MAXALIGN(tupleSize) on the right side of the expression? tupleChunkSize = tupleSize I understand this was missing before, but both bump.c does consume at least MAXALIGN(<req_size>) in BumpAlloc(). * while (16 * maxBlockSize > work_mem * 1024L) The 1024L predates the change made in 041e8b95. "1024L" needs to be replaced with "(Size) 1024" Maybe you could just replace the while loop and the subsequent "if" check with: /* * Like CreateWorkExprContext(), use smaller sizings for smaller work_mem, * to avoid large jumps in memory usage. */ maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16); /* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */ maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE); /* and no smaller than ALLOCSET_DEFAULT_INITSIZE */ maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE); I believe that gives the same result without the looping. * In hash_create_memory(), can you get rid of the minContextSize and initBlockSize variables? * Is it worth an Assert() theck additionalsize > 0? * The amount of space available is the additionalsize requested in the call * to BuildTupleHashTable(). If additionalsize was specified as zero, no * additional space is available and this function should not be called. */ static inline void * TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry) { return (char *) entry->firstTuple - hashtable->additionalsize; } Benchmarks: I was also experimenting with the performance of this using the following test case: create table c (a int not null); insert into c select a from generate_Series(1,1000) a; vacuum freeze analyze c; Query: select a,count(*) from c group by a; Average TPS over 10x 10 second runs with -M prepared master: 3653.9853988 v7-0001: 3741.815979 v7-0001+0002: 3737.4313064 v7-0001+0002+0003: 3834.6271706 v7-0001+0002+0004+0004: 3912.1158887 I also tried out changing hash_agg_check_limits() so that it no longer calls MemoryContextMemAllocated and instead uses ->mem_allocated directly and with all the other patches got: v7-0001+0002+0004+0004+extra: 4049.0732381 We probably shouldn't do exactly that as it be better not to access that internal field from outside the memory context code. A static inline function in memutils.h that handles the non-recursive callers might be nice. I've attached my results of running your test in graph form. I also see a small regression for these small scale tests. I wondering if it's worth mocking up some code to see what the performance would be like without the additional memcpy() that's new to LookupTupleHashEntry_internal(). How about hacking something up that adds an additionalsize field to TupleDesc and then set that field to your additional size and have heap_form_minimal_tuple() allocate that much extra memory? David
Attachment
On Fri, 2025-02-28 at 17:09 +1300, David Rowley wrote: > * my_log2() takes a "long" parameter type but transitionSpace is a > "Size". These types aren't the same width on all platforms we > support. > Maybe pg_nextpower2_size_t() is a better option. Done. > * Should the following have MAXALIGN(tupleSize) on the right side of > the expression? Done. > Maybe you could just replace the while loop and the subsequent "if" > check with: Done. > * In hash_create_memory(), can you get rid of the minContextSize and > initBlockSize variables? Done. > * Is it worth an Assert() theck additionalsize > 0? One caller happens to call it unconditionally. It seems better to return NULL if additionalsize == 0. > create table c (a int not null); > insert into c select a from generate_Series(1,1000) a; > vacuum freeze analyze c; The schema you posted is the narrower table, and your numbers better match the wide table you posted before. Was there a mixup? > master: 3653.9853988 > v7-0001: 3741.815979 > v7-0001+0002: 3737.4313064 > v7-0001+0002+0003: 3834.6271706 > v7-0001+0002+0004+0004: 3912.1158887 > > I also tried out changing hash_agg_check_limits() so that it no > longer > calls MemoryContextMemAllocated and instead uses ->mem_allocated > directly and with all the other patches got: > > v7-0001+0002+0004+0004+extra: 4049.0732381 Great! > We probably shouldn't do exactly that as it be better not to access > that internal field from outside the memory context code. A static > inline function in memutils.h that handles the non-recursive callers > might be nice. Both the metacxt as well as the context used for byref transition values can have child contexts, so we should keep the recursion. I just inlined MemoryContextMemAllocated() and MemoryContextTraverseNext(). > I've attached my results of running your test in graph form. Thank you! My results (with wide tables): GROUP BY EXCEPT master: 2151 1732 entire v8 series: 2054 1740 In some of the patches in the middle of the series, I ran into some hard-to-explain regressions, so consider my results preliminary. I may need to profile and figure out what's going on. I didn't see any overall regression. But the series overall seems about even, while the memory consumption is ~35% lower for the example I posted in the first message in the thread. > How about hacking something up that > adds an additionalsize field to TupleDesc and then set that field to > your additional size and have heap_form_minimal_tuple() allocate that > much extra memory? I assume we wouldn't want to actually add a field to TupleDescData, right? When I reworked the ExecCopySlotMinimalTupleExtra() API to place the extra memory before the tuple, it worked out to be a bit cleaner with fewer special cases, so I'm fine with that API now. Regards, Jeff Davis
Attachment
- v8-0002-Create-accessor-functions-for-TupleHashEntry.patch
- v8-0003-Inline-TupleHashEntryGetTuple-and-.GetAdditional.patch
- v8-0004-Remove-additional-pointer-from-TupleHashEntryData.patch
- v8-0005-Add-ExecCopySlotMinimalTupleExtra.patch
- v8-0006-Use-ExecCopySlotMinimalTupleExtra-to-avoid-a-memc.patch
- v8-0007-Inline-MemoryContextMemAllocated.patch
- v8-0001-HashAgg-use-Bump-allocator-for-hash-TupleHashTabl.patch
On Tue, 2025-03-04 at 17:28 -0800, Jeff Davis wrote: > My results (with wide tables): > > GROUP BY EXCEPT > master: 2151 1732 > entire v8 series: 2054 1740 I'm not sure what I did with the EXCEPT test, above, perhaps I had filled the test table twice by mistake or something? New numbers below: Data: create table c (a int not null); insert into c select a from generate_Series(1,1000) a; create table big(i int, j int); insert into big select g, 1 from generate_series(1,10000000) g; create table wide (t text not null); insert into wide select repeat(md5(a::text),10) from generate_Series(1,1000) a; create table wide_copy (t text not null); insert into wide_copy select * from wide; vacuum freeze analyze; Results: c(tps) wide(tps) except(tps) big(ms) big(MiB) 91ecb5e0bc 4768 2091 2317 3853 768 master 4942 2063 2323 3831 768 0001 4932 2038 2361 3712 616 0002 4601 2034 2365 3753 616 0003 4850 2040 2418 3749 616 0004 4744 2065 2282 3665 488 0005 4630 1994 2307 3665 488 0006 4809 2063 2394 3646 488 0007 4752 2070 2479 3659 488 Note: c.sql, wide.sql, and except.sql are measured in tps, higher is better. big.sql is measured in ms and MiB, lower is better. For some reason I'm getting a decline of about 3% in the c.sql test that seems to be associated with the accessor functions, even when inlined. I'm also not seeing as much benefit from the inlining of the MemoryContextMemAllocated(). But these mixed test results are minor compared with the memory savings of 35% and the more consistently- improved performance of 5% on the larger test (which is also integers), so I plan to commit it. Regards, Jeff Davis
Attachment
- v9-0001-HashAgg-use-Bump-allocator-for-hash-TupleHashTabl.patch
- v9-0002-Create-accessor-functions-for-TupleHashEntry.patch
- v9-0003-Inline-TupleHashEntryGetTuple-and-.GetAdditional.patch
- v9-0004-Remove-additional-pointer-from-TupleHashEntryData.patch
- v9-0005-Add-ExecCopySlotMinimalTupleExtra.patch
- v9-0006-Use-ExecCopySlotMinimalTupleExtra-to-avoid-a-memc.patch
- v9-0007-Inline-MemoryContextMemAllocated.patch
- c.sql
- except.sql
- wide.sql
On Sat, 2025-03-22 at 09:39 -0700, Jeff Davis wrote: > For some reason I'm getting a decline of about 3% in the c.sql test > that seems to be associated with the accessor functions, even when > inlined. I'm also not seeing as much benefit from the inlining of the > MemoryContextMemAllocated(). But these mixed test results are minor > compared with the memory savings of 35% and the more consistently- > improved performance of 5% on the larger test (which is also > integers), > so I plan to commit it. Committed, except for v9-0007. (Note that some commits were combined; they were only separated originally for performance testing.) I attached a new version of v9-0007, now v10-0001, that uses recurse=false for two of the memory contexts. I didn't see a major speedup, but posting here anyway. Regards, Jeff Davis