Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators - Mailing list pgsql-hackers
From | James Hunter |
---|---|
Subject | Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators |
Date | |
Msg-id | CAJVSvF6bDfS1_MoqH+RNHFXTo4ZNkcHLj+fgZ+C0RSL2sVeQzQ@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators (Jeff Davis <pgsql@j-davis.com>) |
List | pgsql-hackers |
On Wed, Feb 26, 2025 at 4:09 PM Jeff Davis <pgsql@j-davis.com> wrote: > > My idea was that we'd attach work_mem to each Path node and Plan node > at create time. For example, in create_agg_path() it could do: > > pathnode->path.node_work_mem = work_mem; > > And then add to copy_generic_path_info(): > > dest->node_work_mem = src->node_work_mem; > > (and similarly for other nodes; at least those that care about > work_mem) > > Then, at execution time, use node->ps.ss.plan.node_work_mem instead of > work_mem. This is essentially what patches 2 and 3 do. Some comments: First, there's no need to set the workmem_limit at Path time, since it's not needed until the Plan is init-ted into a PlanState. So I set it on the Plan but not on the Path. Second, the logic to assign a workmem_limit to the Agg node is a bit more complicated than in your example, because the Agg could be either a Hash or a Sort. If it's a Hash, it gets work_mem * hash_mem_limit; and if it's a Sort it gets either 0 or work_mem. We can adjust the logic so that it gets work_mem instead of 0, by pushing the complexity out of the original workmem_limit assignment and into later code blocks -- e.g., in an extension -- but we still need to decide whether the Agg is a Hash or a Sort. This is why Patch 2 does: switch (agg->aggstrategy) { case AGG_HASHED: case AGG_MIXED: /* * Because nodeAgg.c will combine all AGG_HASHED nodes into a * single phase, it's easier to store the hash working-memory * limit on the first AGG_{HASHED,MIXED} node, and set it to zero * for all subsequent AGG_HASHED nodes. */ agg->plan.workmem_limit = is_first ? get_hash_memory_limit() / 1024 : 0; break; case AGG_SORTED: /* * Also store the sort-output working-memory limit on the first * AGG_SORTED node, and set it to zero for all subsequent * AGG_SORTED nodes. * * We'll need working-memory to hold the "sort_out" only if this * isn't the last Agg node (in which case there's no one to sort * our output). */ agg->plan.workmem_limit = *is_first_sort && !is_last ? work_mem : 0; *is_first_sort = false; break; default: break; } Notice that the logic also sets the limit to 0 on certain Agg nodes -- this can be avoided, at the cost of added complexity later. The added complexity is because, for example, for Hash Aggs, all Hash tables share the same overall workmem_limit. So any workmem_limit set on subsequent hash Agg nodes would be ignored, which means that setting that limit adds conmplexity by obscuring the underlying logic. > Similarly, we could track the node_mem_wanted, which would be the > estimated amount of memory the node would use if unlimited memory were > available. I believe that's already known (or a simple calculation) at > costing time, and we can propagate it from the Path to the Plan the > same way. And this is what Patch 3 does. As you point out, this is already known, or, if not, it's a simple calculation. This is all that Patch 3 does. > (A variant of this approach could carry the values into the PlanState > nodes as well, and the executor would that value instead.) That's not needed, though, and would violate existing PG conventions: we don't copy anything from Plan to PlanState, because the assumption is that the PlanState always has access to its corresponding Plan. (The reason we copy from Path to Plan, I believe, is that we drop all Paths, to save memory; because we generally have many more Paths than Plans.) > Extensions like yours could have a GUC like ext.query_work_mem and use > planner_hook to modify plan after standard_planner is done, walking the > tree and adjusting each Plan node's node_work_mem to obey > ext.query_work_mem. Another extension might hook in at path generation > time with set_rel_pathlist_hook or set_join_pathlist_hook to create > paths with lower node_work_mem settings that total up to > ext.query_work_mem. I don't sent workmem_limit at Path time, because it's not needed there; but I do set the workmem (estimate) at Path time, exactly so that future optimizer hooks could make use of a Path's workmem (estimate) to decide between different Paths. Patch 3 sets workmem (estimate) on the Path and copies it to the Plan. As you know, there's deliberately not a 1-1 correspondence between Path and Plan (the way there is generally a 1-1 correspondence between Plan and PlanState), so Patch 3 has to do some additional work to propagate the workmem (estimate) from Path to Plan. You can see existing examples of similar work inside file createplan.c. Creating a Plan from a Path is not generally as simple as just copying the Path's fields over; there are lots of special cases. Although Patch 3 sets workmem (estimate) on the Plan, inside createplan.c, Patch 2 doesn't set workmem_limit inside createplan.c. An earlier draft of the patchset *did* set it there, but because of all the special casing in createplan.c, this ended up becoming difficult to understand and maintain. > Either way, the node_work_mem settings would end up > being enforced by the executor by tracking memory usage and spilling > when it exceeds node->ps.ss.plan.node_work_mem. More precisely: the settings are enforced by the executor, by having each PlanState's ExecInitNode() override refer to the Plan.workmem_limit field, rather the corresponding GUC(s). This means that the final workmem_lmit needs to be set before ExecInitNode() is called. > But your patch 0002 and 0003 are entirely different from what I > expected. I just don't understand why we need anything as complex or > specific as ExecAssignWorkMem(). If we just add it at the time the Path > is created, and then propagate it to the plan with > copy_generic_path_info(), that would be a lot less code. What am I > missing? Patches 2 and 3 are as you described above. I have been trying to understand what you mean by "a lot less code," and I think two things about these patches stand out to you: 1. Patch 2 performs its own Plan tree traversal, in ExecAssignWorkMem(), instead of relying on the existing traversal in function create_plan(). I outlined the reasons for this decision above. Because of the point immediately below this, embedding Patch 2's logic into create_plan() ended up making the code much harder to follow, so I broke out the traversal into its own (very simple) ExecAssignWorkMem() function. 2. Patches 2 and 3 necessarily contain logic for various special cases, where the workmem (estimate) and workmem_limit are not as simple as in your example above. (But your example is actually not as simple as you make it out to be, as discussed above and below...) To understand (2), it helps to have a sense for how, for example, file createplan.c has already been extended to handle special cases. We already have functions like copy_plan_costsize(), make_unique_from_sortclauses(), make_sort_from_groupcols(), etc. Just from the function names, it's clear that we routinely generate new Plan nodes that don't have corresponding Paths. Existing function create_groupingsets_plan() is 150 Lines of Code, because it turns a single GroupingSetsPath into a chain of Agg nodes. And, of course, there's the whole logic around turning AlternativeSubPlans into SubPlans, inside file setrefs.c! So, generally, Patches 2 and 3 do exactly what you expect, but they handle (existing) special cases, which ends up requiring more code. If PostgreSQL didn't have these special cases, I think the patches would be as short as you expect. For example, if Agg nodes behaved as in your example, quoted at the top of this email, then we wouldn't need Patch 2's additional logic to assign workmem_limit to Aggs (and we wouldn't need the corresponding logic in Patch 3, to assign workmem (estimate) to Aggs, either). But Aggs aren't as simple as in your example -- they have Hash limits and Sort limits; they have a side-chain of Agg nodes; they have input sets they need to Sort; etc. And so we need a couple dozen lines of code to handle them. Thanks for the feedback, James Hunter
pgsql-hackers by date: