Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators - Mailing list pgsql-hackers

From James Hunter
Subject Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date
Msg-id CAJVSvF6bDfS1_MoqH+RNHFXTo4ZNkcHLj+fgZ+C0RSL2sVeQzQ@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Wed, Feb 26, 2025 at 4:09 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> My idea was that we'd attach work_mem to each Path node and Plan node
> at create time. For example, in create_agg_path() it could do:
>
>   pathnode->path.node_work_mem = work_mem;
>
> And then add to copy_generic_path_info():
>
>   dest->node_work_mem = src->node_work_mem;
>
> (and similarly for other nodes; at least those that care about
> work_mem)
>
> Then, at execution time, use node->ps.ss.plan.node_work_mem instead of
> work_mem.

This is essentially what patches 2 and 3 do. Some comments:

First, there's no need to set the workmem_limit at Path time, since
it's not needed until the Plan is init-ted into a PlanState. So I set
it on the Plan but not on the Path.

Second, the logic to assign a workmem_limit to the Agg node is a bit
more complicated than in your example, because the Agg could be either
a Hash or a Sort. If it's a Hash, it gets work_mem * hash_mem_limit;
and if it's a Sort it gets either 0 or work_mem.

We can adjust the logic so that it gets work_mem instead of 0, by
pushing the complexity out of the original workmem_limit assignment
and into later code blocks -- e.g., in an extension -- but we still
need to decide whether the Agg is a Hash or a Sort. This is why Patch
2 does:

switch (agg->aggstrategy)
{
case AGG_HASHED:
case AGG_MIXED:

    /*
     * Because nodeAgg.c will combine all AGG_HASHED nodes into a
     * single phase, it's easier to store the hash working-memory
     * limit on the first AGG_{HASHED,MIXED} node, and set it to zero
     * for all subsequent AGG_HASHED nodes.
     */
    agg->plan.workmem_limit = is_first ?
        get_hash_memory_limit() / 1024 : 0;
    break;
case AGG_SORTED:

    /*
     * Also store the sort-output working-memory limit on the first
     * AGG_SORTED node, and set it to zero for all subsequent
     * AGG_SORTED nodes.
     *
     * We'll need working-memory to hold the "sort_out" only if this
     * isn't the last Agg node (in which case there's no one to sort
     * our output).
     */
    agg->plan.workmem_limit = *is_first_sort && !is_last ?
        work_mem : 0;

    *is_first_sort = false;
    break;
default:
    break;
}

Notice that the logic also sets the limit to 0 on certain Agg nodes --
this can be avoided, at the cost of added complexity later. The added
complexity is because, for example, for Hash Aggs, all Hash tables
share the same overall workmem_limit. So any workmem_limit set on
subsequent hash Agg nodes would be ignored, which means that setting
that limit adds conmplexity by obscuring the underlying logic.

> Similarly, we could track the node_mem_wanted, which would be the
> estimated amount of memory the node would use if unlimited memory were
> available. I believe that's already known (or a simple calculation) at
> costing time, and we can propagate it from the Path to the Plan the
> same way.

And this is what Patch 3 does. As you point out, this is already
known, or, if not, it's a simple calculation. This is all that Patch 3
does.

> (A variant of this approach could carry the values into the PlanState
> nodes as well, and the executor would that value instead.)

That's not needed, though, and would violate existing PG conventions:
we don't copy anything from Plan to PlanState, because the assumption
is that the PlanState always has access to its corresponding Plan.
(The reason we copy from Path to Plan, I believe, is that we drop all
Paths, to save memory; because we generally have many more Paths than
Plans.)

> Extensions like yours could have a GUC like ext.query_work_mem and use
> planner_hook to modify plan after standard_planner is done, walking the
> tree and adjusting each Plan node's node_work_mem to obey
> ext.query_work_mem. Another extension might hook in at path generation
> time with set_rel_pathlist_hook or set_join_pathlist_hook to create
> paths with lower node_work_mem settings that total up to
> ext.query_work_mem.

I don't sent workmem_limit at Path time, because it's not needed
there; but I do set the workmem (estimate) at Path time, exactly so
that future optimizer hooks could make use of a Path's workmem
(estimate) to decide between different Paths.

Patch 3 sets workmem (estimate) on the Path and copies it to the Plan.
As you know, there's deliberately not a 1-1 correspondence between
Path and Plan (the way there is generally a 1-1 correspondence between
Plan and PlanState), so Patch 3 has to do some additional work to
propagate the workmem (estimate) from Path to Plan. You can see
existing examples of similar work inside file createplan.c. Creating a
Plan from a Path is not generally as simple as just copying the Path's
fields over; there are lots of special cases.

Although Patch 3 sets workmem (estimate) on the Plan, inside
createplan.c, Patch 2 doesn't set workmem_limit inside createplan.c.
An earlier draft of the patchset *did* set it there, but because of
all the special casing in createplan.c, this ended up becoming
difficult to understand and maintain.

> Either way, the node_work_mem settings would end up
> being enforced by the executor by tracking memory usage and spilling
> when it exceeds node->ps.ss.plan.node_work_mem.

More precisely: the settings are enforced by the executor, by having
each PlanState's ExecInitNode() override refer to the
Plan.workmem_limit field, rather the corresponding GUC(s). This means
that the final workmem_lmit needs to be set before ExecInitNode() is
called.

> But your patch 0002 and 0003 are entirely different from what I
> expected. I just don't understand why we need anything as complex or
> specific as ExecAssignWorkMem(). If we just add it at the time the Path
> is created, and then propagate it to the plan with
> copy_generic_path_info(), that would be a lot less code. What am I
> missing?

Patches 2 and 3 are as you described above. I have been trying to
understand what you mean by "a lot less code," and I think two things
about these patches stand out to you:

 1. Patch 2 performs its own Plan tree traversal, in
ExecAssignWorkMem(), instead of relying on the existing traversal in
function create_plan(). I outlined the reasons for this decision
above. Because of the point immediately below this, embedding Patch
2's logic into create_plan() ended up making the code much harder to
follow, so I broke out the traversal into its own (very simple)
ExecAssignWorkMem() function.

 2. Patches 2 and 3 necessarily contain logic for various special
cases, where the workmem (estimate) and workmem_limit are not as
simple as in your example above. (But your example is actually not as
simple as you make it out to be, as discussed above and below...)

To understand (2), it helps to have a sense for how, for example, file
createplan.c has already been extended to handle special cases. We
already have functions like copy_plan_costsize(),
make_unique_from_sortclauses(), make_sort_from_groupcols(), etc. Just
from the function names, it's clear that we routinely generate new
Plan nodes that don't have corresponding Paths. Existing function
create_groupingsets_plan() is 150 Lines of Code, because it turns a
single GroupingSetsPath into a chain of Agg nodes. And, of course,
there's the whole logic around turning AlternativeSubPlans into
SubPlans, inside file setrefs.c!

So, generally, Patches 2 and 3 do exactly what you expect, but they
handle (existing) special cases, which ends up requiring more code. If
PostgreSQL didn't have these special cases, I think the patches would
be as short as you expect. For example, if Agg nodes behaved as in
your example, quoted at the top of this email, then we wouldn't need
Patch 2's additional logic to assign workmem_limit to Aggs (and we
wouldn't need the corresponding logic in Patch 3, to assign workmem
(estimate) to Aggs, either).

But Aggs aren't as simple as in your example -- they have Hash limits
and Sort limits; they have a side-chain of Agg nodes; they have input
sets they need to Sort; etc. And so we need a couple dozen lines of
code to handle them.

Thanks for the feedback,
James Hunter



pgsql-hackers by date:

Previous
From: Alena Rybakina
Date:
Subject: Re: Vacuum statistics
Next
From: "David G. Johnston"
Date:
Subject: Re: Docs for pg_basebackup needs v17 note for incremental backup