Re: Default setting for enable_hashagg_disk - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Default setting for enable_hashagg_disk |
Date | |
Msg-id | 20200625171756.GB12486@momjian.us Whole thread Raw |
In response to | Re: Default setting for enable_hashagg_disk (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Default setting for enable_hashagg_disk
Re: Default setting for enable_hashagg_disk |
List | pgsql-hackers |
On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote: > On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian <bruce@momjian.us> wrote: > > I think my main point is that work_mem was not being honored for > > hash-agg before, but now that PG 13 can do it, we are again allowing > > work_mem not to apply in certain cases. I am wondering if our hard > > limit for work_mem is the issue, and we should make that more flexible > > for all uses. > > I mean, that's pretty much what we're talking about here, isn't it? It > seems like in your previous two replies you were opposed to separating > the plan-type limit from the execution-time limit, but that idea is > precisely a way of being more flexible (and extending it to other plan > nodes is a way of making it more flexible for more use cases). I think it is was Tom who was complaining about plan vs. execution time control. > As I think you know, if you have a system where the workload varies a > lot, you may sometimes be using 0 copies of work_mem and at other > times 1000 or more copies, so the value has to be chosen > conservatively as a percentage of system memory, else you start > swapping or the OOM killer gets involved. On the other hand, some plan > nodes get a lot less efficient when the amount of memory available > falls below some threshold, so you can't just set this to a tiny value > and forget about it. Because the first problem is so bad, most people > set the value relatively conservatively and just live with the > performance consequences. But this also means that they have memory > left over most of the time, so the idea of letting a node burst above > its work_mem allocation when something unexpected happens isn't crazy: > as long as only a few nodes do that here and there, rather than, say, > all the nodes doing it all at the same time, it's actually fine. If we > had a smarter system that could dole out more work_mem to nodes that > would really benefit from it and less to nodes where it isn't likely > to make much difference, that would be similar in spirit but even > better. I think the issue is that in PG 13 work_mem controls sorts and hashes with a new hard limit for hash aggregation: https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries. In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if was expected it to exceed work_mem, but if we assumed it would be less than work_mem and it was more, we exceeded work_mem allocation for that node. In PG 13, we "limit" memory to work_mem and spill to disk if we exceed it. We should really have always documented that hash agg could exceed work_mem for misestimation, and if we add a hash_agg work_mem misestimation bypass setting we should document this setting in work_mem as well. But then the question is why do we allow this bypass only for hash agg? Should work_mem have a settings for ORDER BY, merge join, hash join, and hash agg, e.g.: work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB' Yeah, crazy syntax, but you get the idea. I understand some nodes are more sensitive to disk spill than others, so shouldn't we be controlling this at the work_mem level, rather than for a specific node type like hash agg? We could allow for misestimation over allocation of hash agg work_mem by splitting up the hash agg values: work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB hash_agg_max=200MB' but _avoiding_ hash agg if it is estimated to exceed work mem and spill to disk is not something to logically control at the work mem level, which leads so something like David Rowley suggested, but with different names: enable_hashagg = on | soft | avoid | off where 'on' and 'off' are the current PG 13 behavior, 'soft' means to treat work_mem as a soft limit and allow it to exceed work mem for misestimation, and 'avoid' means to avoid hash agg if it is estimated to exceed work mem. Both 'soft' and 'avoid' don't spill to disk. David's original terms of "trynospill" and "neverspill" were focused on spilling, not on its interaction with work_mem, and I found that confusing. Frankly, if it took me this long to get my head around this, I am unclear how many people will understand this tuning feature enough to actually use it. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
pgsql-hackers by date: