Home > mailing lists

Re: Default setting for enable_hashagg_disk - Mailing list pgsql-hackers

From	Bruce Momjian
Subject	Re: Default setting for enable_hashagg_disk
Date	June 25, 2020 20:17:56
Msg-id	20200625171756.GB12486@momjian.us Whole thread Raw
In response to	Re: Default setting for enable_hashagg_disk (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Default setting for enable_hashagg_disk Re: Default setting for enable_hashagg_disk
List	pgsql-hackers

Tree view

On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote:
> On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian <bruce@momjian.us> wrote:
> > I think my main point is that work_mem was not being honored for
> > hash-agg before, but now that PG 13 can do it, we are again allowing
> > work_mem not to apply in certain cases.  I am wondering if our hard
> > limit for work_mem is the issue, and we should make that more flexible
> > for all uses.
> 
> I mean, that's pretty much what we're talking about here, isn't it? It
> seems like in your previous two replies you were opposed to separating
> the plan-type limit from the execution-time limit, but that idea is
> precisely a way of being more flexible (and extending it to other plan
> nodes is a way of making it more flexible for more use cases).

I think it is was Tom who was complaining about plan vs. execution time
control.

> As I think you know, if you have a system where the workload varies a
> lot, you may sometimes be using 0 copies of work_mem and at other
> times 1000 or more copies, so the value has to be chosen
> conservatively as a percentage of system memory, else you start
> swapping or the OOM killer gets involved. On the other hand, some plan
> nodes get a lot less efficient when the amount of memory available
> falls below some threshold, so you can't just set this to a tiny value
> and forget about it. Because the first problem is so bad, most people
> set the value relatively conservatively and just live with the
> performance consequences. But this also means that they have memory
> left over most of the time, so the idea of letting a node burst above
> its work_mem allocation when something unexpected happens isn't crazy:
> as long as only a few nodes do that here and there, rather than, say,
> all the nodes doing it all at the same time, it's actually fine. If we
> had a smarter system that could dole out more work_mem to nodes that
> would really benefit from it and less to nodes where it isn't likely
> to make much difference, that would be similar in spirit but even
> better.

I think the issue is that in PG 13 work_mem controls sorts and hashes
with a new hard limit for hash aggregation:

    https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

    Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash
    tables are used in hash joins, hash-based aggregation, and hash-based
    processing of IN subqueries.

In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if
was expected it to exceed work_mem, but if we assumed it would be less
than work_mem and it was more, we exceeded work_mem allocation for that
node.  In PG 13, we "limit" memory to work_mem and spill to disk if we
exceed it.

We should really have always documented that hash agg could exceed
work_mem for misestimation, and if we add a hash_agg work_mem
misestimation bypass setting we should document this setting in work_mem
as well.

But then the question is why do we allow this bypass only for hash agg? 
Should work_mem have a settings for ORDER BY, merge join, hash join, and
hash agg, e.g.:

    work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB'

Yeah, crazy syntax, but you get the idea.  I understand some nodes are
more sensitive to disk spill than others, so shouldn't we be controlling
this at the work_mem level, rather than for a specific node type like
hash agg?  We could allow for misestimation over allocation of hash agg
work_mem by splitting up the hash agg values:

    work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB hash_agg_max=200MB'

but _avoiding_ hash agg if it is estimated to exceed work mem and spill
to disk is not something to logically control at the work mem level,
which leads so something like David Rowley suggested, but with different
names:

    enable_hashagg = on | soft | avoid | off

where 'on' and 'off' are the current PG 13 behavior, 'soft' means to
treat work_mem as a soft limit and allow it to exceed work mem for
misestimation, and 'avoid' means to avoid hash agg if it is estimated to
exceed work mem.  Both 'soft' and 'avoid' don't spill to disk.

David's original terms of "trynospill" and "neverspill" were focused on
spilling, not on its interaction with work_mem, and I found that
confusing.

Frankly, if it took me this long to get my head around this, I am
unclear how many people will understand this tuning feature enough to
actually use it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

pgsql-hackers by date:

From: Jeff Davis
Date: 25 June 2020, 20:15:52
Subject: Re: Default setting for enable_hashagg_disk

From: Jehan-Guillaume de Rorthais
Date: 25 June 2020, 20:27:54
Subject: Re: [patch] demote

Re: Default setting for enable_hashagg_disk - Mailing list pgsql-hackers

Previous

Next