Re: Default setting for enable_hashagg_disk - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Default setting for enable_hashagg_disk
Date
Msg-id CAH2-WzkqNc=qJ5OUhL3m=OETBx24gHWw6g1Uz52wJzdAzCTzvA@mail.gmail.com
Whole thread Raw
In response to Re: Default setting for enable_hashagg_disk  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Default setting for enable_hashagg_disk
Re: Default setting for enable_hashagg_disk
List pgsql-hackers
On Tue, Jul 14, 2020 at 12:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
> - I thought the problem we were trying to solve here was that, in v12,
> if the planner thinks that your hashagg will fit in memory when really
> it doesn't, you will get good performance because we'll cheat; in v13,
> you'll get VERY bad performance because we won't.

That is the problem we started out with. I propose to solve a broader
problem that I believe mostly encompasses the original problem (it's
an "inventor's paradox" situation). Although the exact degree to which
it truly addresses the original problem will vary across
installations, I believe that it will go a very long way towards
cutting down on problems for users upgrading to Postgres 13 generally.

> - So, if hash_mem_multiplier affects both planning and execution, it
> doesn't really solve the problem. Neither does adjusting the existing
> work_mem setting. Imagine that you have two queries. The planner
> thinks Q1 will use 1GB of memory for a HashAgg but it will actually
> need 2GB. It thinks Q2 will use 1.5GB for a HashAgg but it will
> actually need 3GB. If you plan using a 1GB memory limit, Q1 will pick
> a HashAgg and perform terribly when it spills. Q2 will pick a
> GroupAggregate which will be OK but not great. If you plan with a 2GB
> memory limit, Q1 will pick a HashAgg and will not spill so now it will
> be in great shape. But Q2 will pick a HashAgg and then spill so it
> will stink. Oops.

Maybe I missed your point here. The problem is not so much that we'll
get HashAggs that spill -- there is nothing intrinsically wrong with
that. While it's true that the I/O pattern is not as sequential as a
similar group agg + sort, that doesn't seem like the really important
factor here. The really important factor is that in-memory HashAggs
can be blazingly fast relative to *any* alternative strategy -- be it
a HashAgg that spills, or a group aggregate + sort that doesn't spill,
whatever. We're mostly concerned about keeping the one available fast
strategy than we are about getting a new, generally slow strategy.

There will be no problems at all unless and until we're short on
memory, because you can just increase work_mem and everything works
out, regardless of the details. Obviously the general problems we
anticipate only crop up when increasing work_mem stops being a viable
DBA strategy.

By teaching the system to have at least a crude appreciation of the
value of memory when hashing vs when sorting, the system is often able
to give much more memory to Hash aggs (and hash joins). Increasing
hash_mem_multiplier (maybe while also decreasing work_mem) will be
beneficial when we take memory from things that don't really need so
much, like sorts (or even CTE tuplestores) -- we reduce the memory
pressure without paying a commensurate price in system throughput
(maybe even only a very small hit). As a bonus, everything going
faster may actually *reduce* the memory usage for the system as a
whole, even as individual queries use more memory.

Under this scheme, it may well not matter that you cannot cheat
(Postgres 12 style) anymore, because you'll be able to use the memory
that is available sensibly -- regardless of whether or not the group
estimates are very good (we have to have more than zero faith in the
estimates -- they can be bad without being terrible). Maybe no amount
of tuning can ever restore the desirable Postgres 12 performance
characteristics you came to rely on, but remaining "regressions" are
probably cases where the user was flying pretty close to the sun
OOM-wise all along. They may have been happy with Postgres 12, but at
a certain point that really is something that you have to view as a
fool's paradise, even if like me you happen to be a memory Keynesian.

Really big outliers tend to be rare and therefore something that the
user can afford to have go slower. It's the constant steady stream of
medium-sized hash aggs that we mostly need to worry about. To the
extent that that's true, hash_mem_multiplier addresses the problem on
the table.

> - An escape hatch that prevents spilling at execution time *does*
> solve this problem, but now suppose we add a Q3 which the planner
> thinks will use 512MB of memory but at execution time it will actually
> consume 512GB due to the row count estimate being 1024x off. So if you
> enable the escape hatch to get back to a situation where Q1 and Q2
> both perform acceptably, then Q3 makes your system OOM.

Right. Nothing stops these two things from being true at the same time.

> - If you were to instead introduce a GUC like what I proposed before,
> which allows the execution-time memory usage to exceed what was
> planned, but only by a certain margin, then you can set
> hash_mem_execution_overrun_multiplier_thingy=2.5 and call it a day.
> Now, no matter how you set work_mem, you're fine. Depending on the
> value you choose for work_mem, you may get group aggregates for some
> of the queries. If you set it large enough that you get hash
> aggregates, then Q1 and Q2 will avoid spilling (which works but is
> slow) because the overrun is less than 2x. Q3 will spill, so you won't
> OOM. Wahoo!

But we'll have to live with that kludge for a long time, and haven't
necessarily avoided any risk compared to the hash_mem_multiplier
alternative. I think that having a shadow memory limit for the
executor is pretty ugly.

I'm trying to come up with a setting that can sensibly be tuned at the
system level. Not an escape hatch, which seems worth avoiding.
Admittedly, this is not without its downsides.

> - If you hold my feet to the fire and ask me to choose between a
> Boolean escape hatch (rather than a multiplier-based one) and
> hash_mem_multiplier, gosh, I don't know. I guess the Boolean escape
> hatch? I mean it's a pretty bad solution, but at least if I have that
> I can get both Q1 and Q2 to perform well at the same time, and I guess
> I'm no worse off than I was in v12.

Fortunately you don't have to choose. Doing both together might make
sense, to cover any remaining user apps that still experience problems
after tuning hash_mem_multiplier. We can take a wait and see approach
to this, as Tom suggested recently.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"
Next
From: Bruce Momjian
Date:
Subject: Re: Postgres is not able to handle more than 4k tables!?