Thread: Re: HashAgg degenerate case
Hi, On 2024-11-05 16:59:56 -0800, Jeff Davis wrote: > Fixing it seems fairly easy though: we just need to completely destroy > the hash table each time and recreate it. Something close to the > attached patch (rough). That'll often be *way* slower though. Both because acquiring and faulting-in memory is far from free and because it'd often lead to starting to grow the hashtable from a small size again. I think this patch would lead to way bigger regressions than the occasionally too large hashtable does. I'm not saying that we shouldn't do something about that, but I don't think it can be this. Greetings, Andres Freund
On Fri, 2024-11-08 at 11:41 -0500, Andres Freund wrote: > That'll often be *way* slower though. Both because acquiring and > faulting-in > memory is far from free and because it'd often lead to starting to > grow the > hashtable from a small size again. I assume this statement also applies to my more recent message? https://www.postgresql.org/message-id/4e9bfa724b301b55a2a242ebebcddea62b2e1292.camel%40j-davis.com > I think this patch would lead to way bigger regressions than the > occasionally > too large hashtable does. I'm not saying that we shouldn't do > something about > that, but I don't think it can be this. The case I observed had a bucket array of ~8M, which took about 200MB, while the hash_mem_limit was only 128MB. I'm not quite sure how it got into that state (probably related to the doubling behavior of the bucket array), but once it did, it was incredibly slow because the first tuple always triggered spilling to disk. I can think of two approaches to solve it: 1. Detect the case when there is an obvious imbalance, like the metacxt using 90+% of the memory limit before a batch even begins, and then destroy/recreate the hash table. 2. Change the rules in hash_agg_check_limits() so that reasonable progress can be made regardless of the size of metacxt. For instance, don't count meta_mem as taking more than 75% of the limit. Or always permit a new group if there are fewer than 25% of the hash_ngroups_limit. Thoughts? Regards, Jeff Davis
On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote: > I can think of two approaches to solve it: Another thought: the point of spilling is to avoid using additional memory by adding a group. If the bucket array is 89.8% full, adding a new group doesn't require new buckets need to be allocated, so we only have the firstTuple and pergroup data to worry about. If that causes the memory limit to be exceeded, it's the perfect time to switch to spill mode. But if the bucket array is 89.9% full, then adding a new group will cause the bucket array to double. If that causes the memory limit to be exceeded, then we can switch to spill mode, but it's wasteful to do so because (a) we won't be using most of those new buckets; (b) the new buckets will crowd out space for subsequent batches and even fewer of the buckets will be used; and (c) the memory accounting can be off by quite a bit. What if we have a check where, if the metacxt is using more than 40% of the memory, and if adding a new group would reach the grow_threshold, then enter spill mode immediately? To make this work, I think we either need to use a tuplehash_lookup() followed by a tuplehash_insert() (two lookups for each new group), or we would need a new API into simplehash like tuplehash_insert_without_growing() that would return NULL instead of growing. This approach might not be backportable, but it might be a good approach for 18+. Regards, Jeff Davis
Hi, On 2024-11-08 12:40:39 -0800, Jeff Davis wrote: > On Fri, 2024-11-08 at 10:48 -0800, Jeff Davis wrote: > > I can think of two approaches to solve it: > > Another thought: the point of spilling is to avoid using additional > memory by adding a group. > > If the bucket array is 89.8% full, adding a new group doesn't require > new buckets need to be allocated, so we only have the firstTuple and > pergroup data to worry about. If that causes the memory limit to be > exceeded, it's the perfect time to switch to spill mode. > But if the bucket array is 89.9% full, then adding a new group will > cause the bucket array to double. If that causes the memory limit to be > exceeded, then we can switch to spill mode, but it's wasteful to do so > because (a) we won't be using most of those new buckets; (b) the new > buckets will crowd out space for subsequent batches and even fewer of > the buckets will be used; and (c) the memory accounting can be off by > quite a bit. > What if we have a check where, if the metacxt is using more than 40% of > the memory, and if adding a new group would reach the grow_threshold, > then enter spill mode immediately? That assumes the bucket array is a significant portion of the memory usage - but that's not a given, right? We also might need to grow to keep the number of conflict chains at a manageable level. > To make this work, I think we either need to use a tuplehash_lookup() > followed by a tuplehash_insert() (two lookups for each new group), or we > would need a new API into simplehash like tuplehash_insert_without_growing() > that would return NULL instead of growing. The former seems entirely non-viable on performance grounds. The latter might be ok, but I'm vary on code complexity grounds. What about simply checking whether the bucket array was a significant portion of the memory usage at spill time and recreating the hashtable instead of resetting IFF it was? > This approach might not be backportable, but it might be a good approach for > 18+. I doubt any change around this ought to be backpatched. The likelihood of regressions seems to high. Greetings, Andres Freund