Re: HashAgg degenerate case - Mailing list pgsql-bugs

From Andres Freund
Subject Re: HashAgg degenerate case
Date
Msg-id 3btw7dgj56k5samfzyrxnqdoiocbiycp44osza3iahzn3jqdro@ecxairlbdzqx
Whole thread Raw
In response to Re: HashAgg degenerate case  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-bugs
Hi,

On 2024-11-08 12:40:39 -0800, Jeff Davis wrote:
> On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote:
> > I can think of two approaches to solve it:
> 
> Another thought: the point of spilling is to avoid using additional
> memory by adding a group.
> 
> If the bucket array is 89.8% full, adding a new group doesn't require
> new buckets need to be allocated, so we only have the firstTuple and
> pergroup data to worry about. If that causes the memory limit to be
> exceeded, it's the perfect time to switch to spill mode.

> But if the bucket array is 89.9% full, then adding a new group will
> cause the bucket array to double. If that causes the memory limit to be
> exceeded, then we can switch to spill mode, but it's wasteful to do so
> because (a) we won't be using most of those new buckets; (b) the new
> buckets will crowd out space for subsequent batches and even fewer of
> the buckets will be used; and (c) the memory accounting can be off by
> quite a bit.

> What if we have a check where, if the metacxt is using more than 40% of
> the memory, and if adding a new group would reach the grow_threshold,
> then enter spill mode immediately?

That assumes the bucket array is a significant portion of the memory usage -
but that's not a given, right? We also might need to grow to keep the number
of conflict chains at a manageable level.


> To make this work, I think we either need to use a tuplehash_lookup()
> followed by a tuplehash_insert() (two lookups for each new group), or we
> would need a new API into simplehash like tuplehash_insert_without_growing()
> that would return NULL instead of growing.

The former seems entirely non-viable on performance grounds.  The latter might
be ok, but I'm vary on code complexity grounds.


What about simply checking whether the bucket array was a significant portion
of the memory usage at spill time and recreating the hashtable instead of
resetting IFF it was?


> This approach might not be backportable, but it might be a good approach for
> 18+.

I doubt any change around this ought to be backpatched. The likelihood of
regressions seems to high.

Greetings,

Andres Freund



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18700: pg_dump doesn't include definitions for collations nor types (e.g. enumeration)
Next
From: RECHTÉ Marc
Date:
Subject: Very long loop breaking logical replication walsender / walreceiver connection