Thread: Re: HashAgg degenerate case

Re: HashAgg degenerate case

From

Andres Freund

Date:

08 November 2024, 19:41:32

Hi,

On 2024-11-05 16:59:56 -0800, Jeff Davis wrote:
> Fixing it seems fairly easy though: we just need to completely destroy
> the hash table each time and recreate it. Something close to the
> attached patch (rough).

That'll often be *way* slower though. Both because acquiring and faulting-in
memory is far from free and because it'd often lead to starting to grow the
hashtable from a small size again.

I think this patch would lead to way bigger regressions than the occasionally
too large hashtable does. I'm not saying that we shouldn't do something about
that, but I don't think it can be this.

Greetings,

Andres Freund

Re: HashAgg degenerate case

From

Jeff Davis

Date:

08 November 2024, 21:48:12

On Fri, 2024-11-08 at 11:41 -0500, Andres Freund wrote:
> That'll often be *way* slower though. Both because acquiring and
> faulting-in
> memory is far from free and because it'd often lead to starting to
> grow the
> hashtable from a small size again.

I assume this statement also applies to my more recent message?

https://www.postgresql.org/message-id/4e9bfa724b301b55a2a242ebebcddea62b2e1292.camel%40j-davis.com

> I think this patch would lead to way bigger regressions than the
> occasionally
> too large hashtable does. I'm not saying that we shouldn't do
> something about
> that, but I don't think it can be this.

The case I observed had a bucket array of ~8M, which took about 200MB,
while the hash_mem_limit was only 128MB. I'm not quite sure how it got
into that state (probably related to the doubling behavior of the
bucket array), but once it did, it was incredibly slow because the
first tuple always triggered spilling to disk.

I can think of two approaches to solve it:

1. Detect the case when there is an obvious imbalance, like the metacxt
using 90+% of the memory limit before a batch even begins, and then
destroy/recreate the hash table.

2. Change the rules in hash_agg_check_limits() so that reasonable
progress can be made regardless of the size of metacxt. For instance,
don't count meta_mem as taking more than 75% of the limit. Or always
permit a new group if there are fewer than 25% of the
hash_ngroups_limit.

Thoughts?

Regards,
    Jeff Davis

Re: HashAgg degenerate case

From

Jeff Davis

Date:

08 November 2024, 23:40:39

On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote:
> I can think of two approaches to solve it:

Another thought: the point of spilling is to avoid using additional
memory by adding a group.

If the bucket array is 89.8% full, adding a new group doesn't require
new buckets need to be allocated, so we only have the firstTuple and
pergroup data to worry about. If that causes the memory limit to be
exceeded, it's the perfect time to switch to spill mode.

But if the bucket array is 89.9% full, then adding a new group will
cause the bucket array to double. If that causes the memory limit to be
exceeded, then we can switch to spill mode, but it's wasteful to do so
because (a) we won't be using most of those new buckets; (b) the new
buckets will crowd out space for subsequent batches and even fewer of
the buckets will be used; and (c) the memory accounting can be off by
quite a bit.

What if we have a check where, if the metacxt is using more than 40% of
the memory, and if adding a new group would reach the grow_threshold,
then enter spill mode immediately? To make this work, I think we either
need to use a tuplehash_lookup() followed by a tuplehash_insert() (two
lookups for each new group), or we would need a new API into simplehash
like tuplehash_insert_without_growing() that would return NULL instead
of growing. This approach might not be backportable, but it might be a
good approach for 18+.

Regards,
    Jeff Davis

Re: HashAgg degenerate case

From

Andres Freund

Date:

12 November 2024, 18:48:31

Hi,

On 2024-11-08 12:40:39 -0800, Jeff Davis wrote:
> On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote:
> > I can think of two approaches to solve it:
> 
> Another thought: the point of spilling is to avoid using additional
> memory by adding a group.
> 
> If the bucket array is 89.8% full, adding a new group doesn't require
> new buckets need to be allocated, so we only have the firstTuple and
> pergroup data to worry about. If that causes the memory limit to be
> exceeded, it's the perfect time to switch to spill mode.

> But if the bucket array is 89.9% full, then adding a new group will
> cause the bucket array to double. If that causes the memory limit to be
> exceeded, then we can switch to spill mode, but it's wasteful to do so
> because (a) we won't be using most of those new buckets; (b) the new
> buckets will crowd out space for subsequent batches and even fewer of
> the buckets will be used; and (c) the memory accounting can be off by
> quite a bit.

> What if we have a check where, if the metacxt is using more than 40% of
> the memory, and if adding a new group would reach the grow_threshold,
> then enter spill mode immediately?

That assumes the bucket array is a significant portion of the memory usage -
but that's not a given, right? We also might need to grow to keep the number
of conflict chains at a manageable level.


> To make this work, I think we either need to use a tuplehash_lookup()
> followed by a tuplehash_insert() (two lookups for each new group), or we
> would need a new API into simplehash like tuplehash_insert_without_growing()
> that would return NULL instead of growing.

The former seems entirely non-viable on performance grounds.  The latter might
be ok, but I'm vary on code complexity grounds.


What about simply checking whether the bucket array was a significant portion
of the memory usage at spill time and recreating the hashtable instead of
resetting IFF it was?


> This approach might not be backportable, but it might be a good approach for
> 18+.

I doubt any change around this ought to be backpatched. The likelihood of
regressions seems to high.

Greetings,

Andres Freund