Re: HashAgg degenerate case - Mailing list pgsql-bugs

From Jeff Davis
Subject Re: HashAgg degenerate case
Date
Msg-id 8c60e698428962f33fdfbfab2206eedba09029fc.camel@j-davis.com
Whole thread Raw
In response to Re: HashAgg degenerate case  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: HashAgg degenerate case
List pgsql-bugs
On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote:
> I can think of two approaches to solve it:

Another thought: the point of spilling is to avoid using additional
memory by adding a group.

If the bucket array is 89.8% full, adding a new group doesn't require
new buckets need to be allocated, so we only have the firstTuple and
pergroup data to worry about. If that causes the memory limit to be
exceeded, it's the perfect time to switch to spill mode.

But if the bucket array is 89.9% full, then adding a new group will
cause the bucket array to double. If that causes the memory limit to be
exceeded, then we can switch to spill mode, but it's wasteful to do so
because (a) we won't be using most of those new buckets; (b) the new
buckets will crowd out space for subsequent batches and even fewer of
the buckets will be used; and (c) the memory accounting can be off by
quite a bit.

What if we have a check where, if the metacxt is using more than 40% of
the memory, and if adding a new group would reach the grow_threshold,
then enter spill mode immediately? To make this work, I think we either
need to use a tuplehash_lookup() followed by a tuplehash_insert() (two
lookups for each new group), or we would need a new API into simplehash
like tuplehash_insert_without_growing() that would return NULL instead
of growing. This approach might not be backportable, but it might be a
good approach for 18+.

Regards,
    Jeff Davis




pgsql-bugs by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: BUG #18692: Segmentation fault when extending a varchar column with a gist index with custom signal length
Next
From: Noah Misch
Date:
Subject: Re: Leader backend hang on IPC/ParallelFinish when LWLock held at parallel query start