Thread: Re: Incorrect CHUNKHDRSZ in nodeAgg.c
David Rowley <dgrowleyml@gmail.com> writes: > While reading nodeAgg.c, I noticed code that uses CHUNKHDRSZ to help > figure out how much memory a tuple uses so the code knows when to > spill to disk. CHUNKHDRSZ is currently set to 16 bytes, which was > fine when that code was added, but it's a few years out-of-date since > c6e0fe1f2 in 2022. > The attached adjusts the 16 to sizeof(MemoryChunk), which is normally > 8, but 16 in assert builds. Yeah, this is more formally correct ... > The memory accounting should be more accurate with the patch, however, > that accounting doesn't account for blocks that the chunks are on. For > that reason, I'm thinking of not backpatching this as it will make > hash aggregates use a bit more memory than unpatched before spilling, > albeit, for most cases, closer to the hash_mem limit, which is when > the spilling should be happening. Agreed that back-patching isn't appropriate. I thought for a bit about whether we shouldn't try to account for palloc power-of-2-block-size overhead here. That omission would typically be a far larger error than the one you are fixing. However, given that the inputs to hash_agg_entry_size are only estimates, I'm not sure that we can hope to do better than the current behavior. Should tuple hash tables be using a different memory context type that doesn't impose that power-of-2 overhead? It's only useful when we expect a fair amount of pfree-and-recycle behavior, but I think we don't have much retail entry removal in tuple hash tables. Could we use a generation or even bump context? regards, tom lane
On Thu, 2 Jan 2025 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I thought for a bit about whether we shouldn't try to account for > palloc power-of-2-block-size overhead here. That omission would > typically be a far larger error than the one you are fixing. However, > given that the inputs to hash_agg_entry_size are only estimates, > I'm not sure that we can hope to do better than the current behavior. Likely the most correct way would be to use GetMemoryChunkSpace(), but there might be some additional overhead to consider there. I looked at [1] and didn't see any mention of using that function for this purpose. There likely is a small overhead to doing so, which is something to consider. > Should tuple hash tables be using a different memory context type > that doesn't impose that power-of-2 overhead? It's only useful > when we expect a fair amount of pfree-and-recycle behavior, but > I think we don't have much retail entry removal in tuple hash > tables. Could we use a generation or even bump context? Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the table. If sizeof(TupleHashEntryData) were a power-of-two, then there'd be no wastage as the hash table always has a power-of-two bucket count and two powers-of-two multiplied are always a power-of-two value. Unfortunately, TupleHashEntryData is 24 bytes and I don't see any easy way to shrink it to 16 bytes. I think what would be more interesting is seeing if we can store the TupleHashEntryData.firstTuple in a bump context. I'd need to check if we ever pfree() those minimal tuples or if we just throw the entire batch of tuples away with a context reset. Since these minimal tuples only contain the GROUP BY columns, for most cases they should be very narrow tuples indeed, so not having a MemoryChunk header would save quite a bit of memory in many cases. I think that's basically 1083f94da plus 6ed83d5fa for nodeAgg.c. David [1] https://www.postgresql.org/message-id/flat/20200325220936.il3ni2fj2j2b45y5%40alap3.anarazel.de#78592829b675371eadf592a99897bcb3
David Rowley <dgrowleyml@gmail.com> writes: > On Thu, 2 Jan 2025 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I thought for a bit about whether we shouldn't try to account for >> palloc power-of-2-block-size overhead here. That omission would >> typically be a far larger error than the one you are fixing. However, >> given that the inputs to hash_agg_entry_size are only estimates, >> I'm not sure that we can hope to do better than the current behavior. > Likely the most correct way would be to use GetMemoryChunkSpace(), but > there might be some additional overhead to consider there. Nah, you've got the wrong mental model. hash_agg_entry_size is trying to predict the average hash entry size in advance of seeing any actual data, so that we can estimate how many entries will fit in work_mem. By the time we can use GetMemoryChunkSpace on an actual entry, it's too late for that. >> Could we use a generation or even bump context? > Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the > table. Meh. I guess we'd have to keep that structure in a context separate from the tuples. Might not be worth the trouble. > I think what would be more interesting is seeing if we can store the > TupleHashEntryData.firstTuple in a bump context. Are you saying the same as above, or something different? regards, tom lane
On Thu, 2 Jan 2025 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > I think what would be more interesting is seeing if we can store the > > TupleHashEntryData.firstTuple in a bump context. > > Are you saying the same as above, or something different? I thought you only meant store the hash buckets in a bump context. I didn't realise you meant the tuples too. Seems we were talking about the same thing. David
On Thu, Jan 2, 2025 at 7:24 AM David Rowley <dgrowleyml@gmail.com> wrote: > > Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the > table. If sizeof(TupleHashEntryData) were a power-of-two, then there'd > be no wastage as the hash table always has a power-of-two bucket count > and two powers-of-two multiplied are always a power-of-two value. > Unfortunately, TupleHashEntryData is 24 bytes and I don't see any easy > way to shrink it to 16 bytes. FYI, there is a proposal for that at https://www.postgresql.org/message-id/817d244237878cebdff0bc363718feaf49a1ea7d.camel@j-davis.com -- John Naylor Amazon Web Services
On Thu, 2025-01-02 at 13:38 +1300, David Rowley wrote: > On Thu, 2 Jan 2025 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > David Rowley <dgrowleyml@gmail.com> writes: > > > I think what would be more interesting is seeing if we can store > > > the > > > TupleHashEntryData.firstTuple in a bump context. > > > > Are you saying the same as above, or something different? > > I thought you only meant store the hash buckets in a bump context. I > didn't realise you meant the tuples too. Seems we were talking about > the same thing. There are several things to keep track of: 1. The bucket array, which is already in its own context (called the metacxt). AllocSet is probably fine here. I agree that there is some room for improvement, but remember that if the bucket array gets to any interesting size, it will be allocated in its own block. 2. The grouping keys (called firstTuple), which are in the tablecxt. These could benefit from the Bump allocator. 3. The pergroup states, which are also stored in the tablecxt, and these can also benefit from the Bump allocator. 4. If the transition type is by-ref, the transition value. If we separate out 4, we can use the Bump allocator for 2 & 3. Regards, Jeff Davis
On Sat, 2025-01-04 at 09:24 +0700, John Naylor wrote: > FYI, there is a proposal for that at > https://www.postgresql.org/message-id/817d244237878cebdff0bc363718feaf49a1ea7d.camel@j-davis.com I had intended to commit some of those patches soon, so if someone sees a conflict with the ideas in this thread, please let me know. Regards, Jeff Davis