Thread: Re: Incorrect CHUNKHDRSZ in nodeAgg.c

Re: Incorrect CHUNKHDRSZ in nodeAgg.c

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> While reading nodeAgg.c, I noticed code that uses CHUNKHDRSZ to help
> figure out how much memory a tuple uses so the code knows when to
> spill to disk.  CHUNKHDRSZ is currently set to 16 bytes, which was
> fine when that code was added, but it's a few years out-of-date since
> c6e0fe1f2 in 2022.

> The attached adjusts the 16 to sizeof(MemoryChunk), which is normally
> 8, but 16 in assert builds.

Yeah, this is more formally correct ...

> The memory accounting should be more accurate with the patch, however,
> that accounting doesn't account for blocks that the chunks are on. For
> that reason, I'm thinking of not backpatching this as it will make
> hash aggregates use a bit more memory than unpatched before spilling,
> albeit, for most cases, closer to the hash_mem limit, which is when
> the spilling should be happening.

Agreed that back-patching isn't appropriate.

I thought for a bit about whether we shouldn't try to account for
palloc power-of-2-block-size overhead here.  That omission would
typically be a far larger error than the one you are fixing.  However,
given that the inputs to hash_agg_entry_size are only estimates,
I'm not sure that we can hope to do better than the current behavior.

Should tuple hash tables be using a different memory context type
that doesn't impose that power-of-2 overhead?  It's only useful
when we expect a fair amount of pfree-and-recycle behavior, but
I think we don't have much retail entry removal in tuple hash
tables.  Could we use a generation or even bump context?

            regards, tom lane



Re: Incorrect CHUNKHDRSZ in nodeAgg.c

From
David Rowley
Date:
On Thu, 2 Jan 2025 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I thought for a bit about whether we shouldn't try to account for
> palloc power-of-2-block-size overhead here.  That omission would
> typically be a far larger error than the one you are fixing.  However,
> given that the inputs to hash_agg_entry_size are only estimates,
> I'm not sure that we can hope to do better than the current behavior.

Likely the most correct way would be to use GetMemoryChunkSpace(), but
there might be some additional overhead to consider there. I looked at
[1] and didn't see any mention of using that function for this
purpose. There likely is a small overhead to doing so, which is
something to consider.

> Should tuple hash tables be using a different memory context type
> that doesn't impose that power-of-2 overhead?  It's only useful
> when we expect a fair amount of pfree-and-recycle behavior, but
> I think we don't have much retail entry removal in tuple hash
> tables.  Could we use a generation or even bump context?

Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the
table. If sizeof(TupleHashEntryData) were a power-of-two, then there'd
be no wastage as the hash table always has a power-of-two bucket count
and two powers-of-two multiplied are always a power-of-two value.
Unfortunately, TupleHashEntryData is 24 bytes and I don't see any easy
way to shrink it to 16 bytes.

I think what would be more interesting is seeing if we can store the
TupleHashEntryData.firstTuple in a bump context. I'd need to check if
we ever pfree() those minimal tuples or if we just throw the entire
batch of tuples away with a context reset. Since these minimal tuples
only contain the GROUP BY columns, for most cases they should be very
narrow tuples indeed, so not having a MemoryChunk header would save
quite a bit of memory in many cases. I think that's basically
1083f94da plus 6ed83d5fa for nodeAgg.c.

David

[1]
https://www.postgresql.org/message-id/flat/20200325220936.il3ni2fj2j2b45y5%40alap3.anarazel.de#78592829b675371eadf592a99897bcb3



Re: Incorrect CHUNKHDRSZ in nodeAgg.c

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 2 Jan 2025 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I thought for a bit about whether we shouldn't try to account for
>> palloc power-of-2-block-size overhead here.  That omission would
>> typically be a far larger error than the one you are fixing.  However,
>> given that the inputs to hash_agg_entry_size are only estimates,
>> I'm not sure that we can hope to do better than the current behavior.

> Likely the most correct way would be to use GetMemoryChunkSpace(), but
> there might be some additional overhead to consider there.

Nah, you've got the wrong mental model.  hash_agg_entry_size is
trying to predict the average hash entry size in advance of seeing
any actual data, so that we can estimate how many entries will fit
in work_mem.  By the time we can use GetMemoryChunkSpace on an
actual entry, it's too late for that.

>> Could we use a generation or even bump context?

> Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the
> table.

Meh.  I guess we'd have to keep that structure in a context separate
from the tuples.  Might not be worth the trouble.

> I think what would be more interesting is seeing if we can store the
> TupleHashEntryData.firstTuple in a bump context.

Are you saying the same as above, or something different?

            regards, tom lane



Re: Incorrect CHUNKHDRSZ in nodeAgg.c

From
David Rowley
Date:
On Thu, 2 Jan 2025 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I think what would be more interesting is seeing if we can store the
> > TupleHashEntryData.firstTuple in a bump context.
>
> Are you saying the same as above, or something different?

I thought you only meant store the hash buckets in a bump context. I
didn't realise you meant the tuples too. Seems we were talking about
the same thing.

David



Re: Incorrect CHUNKHDRSZ in nodeAgg.c

From
John Naylor
Date:
On Thu, Jan 2, 2025 at 7:24 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> Bump wouldn't work due to the SH_FREE() in SH_GROW() when resizing the
> table. If sizeof(TupleHashEntryData) were a power-of-two, then there'd
> be no wastage as the hash table always has a power-of-two bucket count
> and two powers-of-two multiplied are always a power-of-two value.
> Unfortunately, TupleHashEntryData is 24 bytes and I don't see any easy
> way to shrink it to 16 bytes.

FYI, there is a proposal for that at
https://www.postgresql.org/message-id/817d244237878cebdff0bc363718feaf49a1ea7d.camel@j-davis.com

--
John Naylor
Amazon Web Services



Re: Incorrect CHUNKHDRSZ in nodeAgg.c

From
Jeff Davis
Date:
On Thu, 2025-01-02 at 13:38 +1300, David Rowley wrote:
> On Thu, 2 Jan 2025 at 13:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > David Rowley <dgrowleyml@gmail.com> writes:
> > > I think what would be more interesting is seeing if we can store
> > > the
> > > TupleHashEntryData.firstTuple in a bump context.
> >
> > Are you saying the same as above, or something different?
>
> I thought you only meant store the hash buckets in a bump context. I
> didn't realise you meant the tuples too. Seems we were talking about
> the same thing.

There are several things to keep track of:

1. The bucket array, which is already in its own context (called the
metacxt). AllocSet is probably fine here. I agree that there is some
room for improvement, but remember that if the bucket array gets to any
interesting size, it will be allocated in its own block.

2. The grouping keys (called firstTuple), which are in the tablecxt.
These could benefit from the Bump allocator.

3. The pergroup states, which are also stored in the tablecxt, and
these can also benefit from the Bump allocator.

4. If the transition type is by-ref, the transition value.

If we separate out 4, we can use the Bump allocator for 2 & 3.

Regards,
    Jeff Davis




Re: Incorrect CHUNKHDRSZ in nodeAgg.c

From
Jeff Davis
Date:
On Sat, 2025-01-04 at 09:24 +0700, John Naylor wrote:
> FYI, there is a proposal for that at
> https://www.postgresql.org/message-id/817d244237878cebdff0bc363718feaf49a1ea7d.camel@j-davis.com

I had intended to commit some of those patches soon, so if someone sees
a conflict with the ideas in this thread, please let me know.

Regards,
    Jeff Davis