Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators - Mailing list pgsql-hackers

From James Hunter
Subject Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date
Msg-id CAJVSvF6DG-KDdRwMZrK4pg1ehRAhgAGYZS2_=FhUYso+cLGsDw@mail.gmail.com
Whole thread Raw
List pgsql-hackers
On Wed, Jan 22, 2025 at 1:13 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 1/10/25 19:00, James Hunter wrote:
> > ...
> > I wouldn’t change the existing planning logic (at least not in the
> > initial implementaton). So the existing planning logic would choose
> > between different SQL operators, still on the assumption that every
> > operator that needs working memory will get work_mem [*
> > hash_mem_multiplier].
>
> All this seems generally feasible, but it hinges on the to-be-described
> algorithm can distribute the memory in a sensible way that doesn't
> affect the costing too much. If we plan a hash join with nbatch=1, and
> then come back and make it to use nbatch=1024, maybe we wouldn't have
> used hash join at all. Not sure.

I see two problems:
(a) For a given query plan, minimizing spilling without exceeding
memory limits (and thus crashing);
(b) For a given amount of available memory, choosing the optimal plan.

Both problems exist today, and PostgreSQL offers the same tools to
address both: work_mem and hash_mem_multiplier. I argue that these
tools are inadequate for (a), but I think they work reasonably well
for (b).

I propose to solve problem (a), but not (b).

In your example, the reason PostgreSQL plans a Hash Join, with nbatch
= 1, is because the planner's "nbytes" estimate for working memory is
< (hash_mem_multiplier * work_mem). This implicitly assumes that the
PostgreSQL *instance* has at least that memory available.

If it turns out that the instance doesn't have that much memory
available, then the Hash Join will crash. That's the current behavior.

It would be better if, instead, we used nbatch=1024 for the Hash Join,
so we *didn't* crash. (Note that your example implies that work_mem is
set to 1,024x available memory!) This is problem (a). But then, as you
point out, it might be *even better* if we gave up on Hash Join
altogether, and just went with Nested Loop. This is problem (b).

Today, "work_mem" can be set too high or too low. I argue that there's
no way to avoid one or the other, and:
1. If "work_mem" is set too high -- PostgreSQL currently crashes. With
my proposal, it (a) would not crash, but (b) would possibly execute a
sub-optimal plan.
2. If "work_mem" is set too low -- PostgreSQL currently (a) spills
unnecessarily, or (b) chooses a sub-optimal plan. With my proposal, it
would (a) not spill unnecessarily, but would (b) still execute a
sub-optimal plan (if chosen).

I am not proposing to solve (b), the generation of optimal plans, when
memory constraints are a priori unknown. I see that as a separate,
lower-priority problem -- one that would require multi-pass
compilation, which I would like to avoid.

> The fundamental issue seems to be not having any information about how
> much memory might be available to the operator. And in principle we
> can't have that during the bottom-up part of the planning, until after
> we construct the whole plan. Only at that point we know how many
> operators will need work_mem.

It's not just bottom-up vs. top-down: it's multi-pass. I think you
would need something beyond Oracle's Cost-Based Query Transformation
[1], where PostgreSQL generates all possible query path trees, up
front, to get an estimate for the total working-memory requested for
each tree. Then we could distribute "query_work_mem" to each tree, and
then compute the costs once we knew how much working memory each path
node on the tree would actually get.

The algorithm described in the previous paragraph is certainly
*possible*, but it's not remotely *feasible*, in general, because it
requires generating way too many states to cost. For one thing,
instead of costing path nodes individually, it has to cost entire
trees; and there are far more trees than there are tree nodes. For
another: today, PostgreSQL's optimizer goes out of its way not to
generate obviously-bad path nodes, but in the above algorithm, there's
no way to know whether a path is bad until after you've generated the
entire tree.

Anyway, to come up with something feasible, we'd have to apply
heuristics to prune trees, etc.. And it's not clear to me that, after
all of that code complexity and run-time CPU cost, the result would be
any better than just leaving the optimizer as it is. "Better the devil
you know..." etc.

My proposal doesn't try to solve (b), instead relying on the customer
to provide a reasonable "work_mem" estimate, for use by the optimizer.

> Could we get at least some initial estimate how many such operators the
> query *might* end up using? Maybe that'd be just enough a priori
> information to set the effective work_mem for the planning part to make
> this practical.

You could use the # of base relations, as a proxy for # of joins, but
I am not convinced this would improve the optimizer's decision.

> > This assumption might understate or overstate
> > the actual working memory we’ll give the operator, at runtime. If it
> > understates, the planner will be biased in favor of operators that
> > don’t use much working memory. If it overstates, the planner will be
> > biased in favor of operators that use too much working memory.
> >
>
> I'm not quite sure I understand this part. Could you elaborate?

I was just trying to express what you more clearly restated: if
work_mem is too high, vs. the actual memory available on the instance,
then the optimizer will choose Hash Join, even though the optimal
choice might be Nested Loop.

> > (We could add a feedback loop to the planner, or even something simple
> > like generating multiple path, at different “work_mem” limits, but
> > everything I can think of here adds complexity without much potential
> > benefit. So I would defer any changes to the planner behavior until
> > later, if ever.)
> >
>
> What would be the feedback? I can imagine improving the estimate of how
> much memory a given operator needs during the bottom-up phase, but it
> doesn't quite help with knowing what will happen above the current node.

Something like the algorithm I sketched above, in this email, might
work. Of course, it would have to be modified with heuristics, etc.,
to reduce the state space to something manageable...

But my point is just that any feedback loop, or running the optimizer
at different "work_mem" limits, is a bad idea. Leaving the optimizer
as it is seems the least bad choice.

Thanks for your comments,
James

[1] https://dl.acm.org/doi/10.5555/1182635.1164215



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Inconsistent string comparison using modified ICU collations
Next
From: Andrew Dunstan
Date:
Subject: Re: Convert sepgsql tests to TAP