Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators - Mailing list pgsql-hackers
From | James Hunter |
---|---|
Subject | Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators |
Date | |
Msg-id | CAJVSvF5S89XNaH1Pg40DffXD3HEFKxEe8Rew3HviyymOYp5X+w@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators (James Hunter <james.hunter.pg@gmail.com>) |
Responses |
Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators |
List | pgsql-hackers |
Attached please find the patch set I mentioned, above, in [1]. It consists of 4 patches that serve as the building blocks for and a prototype of the "query_work_mem" GUC I proposed: * Patch 1 captures the optimizer’s estimate of how much working memory a particular Plan node would need, to avoid spilling, and stores this on the Plan, next to cost, etc. It also adds a new “work_mem on” option to the EXPLAIN command, to display this working-memory estimate. This “work_mem on” estimate gives the customer a sense of how much working memory a particular query will actually use, and also enables an extension (e.g., Patch 4), to assign working-memory limits, per exec node, intelligently. Patch 1 doesn't change any user-visible behavior, except for displaying workmem estimates via EXPLAIN, when the new "work_mem on" option is specified. * Patch 2 is a prerequisite for Patches 3 and 4. It maintains a subPlan list on the Plan node, next to the existing initPlan list, to store (pointers to) regular SubPlans. The existing initPlan list is needed, because otherwise there’s no way to find the particular SubPlan; but this new subPlan list hasn’t been needed before now, because every SubPlan on the list appears somewhere inside the Plan node’s expressions. The subPlan list is needed now, however, because a SubPlan can use working memory (if it maintains one or two hash tables). So, we need a way to find this SubPlan, so we can set its working-memory limit; and it doesn’t make sense to walk through all the Plan node’s expressions, a second time, after we’ve finalized the plan. Instead, Patch 2 copies regular SubPlans to this new list, inside setrefs.c, so we can find them and assign working memory to them, later. Patch 3 modifies all existing exec nodes to read their working-memory limit off their Plan, rather than off the GUC. It adds a new function, ExecAssignWorkMem(), which gets called from InitPlan(), immediately before we start calling ExecInitNode(). This way, the executor could assign different working-memory limits, based on global memory conditions; but this patch just preserves existing behavior, and copies these limits from the GUCs. Patch 2 doesn't change any user-visible behavior -- it just adds some internal bookkeeping. * Patch 3 extends the new “work_mem on” EXPLAIN option, further, to show the working-memory limit. This is the limit already imposed by PostgreSQL's work_mem and hash_mem_multiplier GUCs. Patch 3 copies this limit from these GUCs, onto a new field stored on the Plan object. It then modifies "EXPLAIN (work_mem on)" to read this limit off the Plan object and display it. Other than this change to EXPLAIN, Patch 3 doesn't change any user-visible behavior. * Patch 4, finally!, adds a hook to allow extensions to override ExecAssignWorkMem(). It also adds an extension, “workmem,” that implements this hook and assigns working memory to individual execution nodes, based on a new workmem.query_work_mem GUC. This extension prevents queries from exceeding workmem.query_work_mem, while also handing out extra memory in the case where the query limit, from Patch 3, is < workmem.query_work_mem. In this way, Patch 4 avoids either undersubscribing or oversubscribing working memory for queries, which is the goal of my proposal. Discussion follows-- A few operators currently do not honor their working-memory limits by spilling; these operators use tuple hash tables — which don’t spill — without implementing their own “spill” logic. I would address these operators in a subsequent release. Note that Hash Agg and Hash Join both spill, as expected, so the major cases already work. I store the working-memory estimate on both Path and Plan objects. Keeping with PostgreSQL convention that a Path is an abstraction of one or more Plan nodes, the Path’s working-memory estimate is “total,” while the Plan’s is “per data structure.” So, if a SQL operator requires 2 sort buffers, the Path’s working-memory estimate will be 2x the Plan’s. The Plan’s estimate is “per data structure,” because it will be used to determine the data structure’s working-memory limit. Note that every operator (save one) currently treats work_mem [* hash_mem_multiplier] as a per-structure limit, rather than a per-operator limit. (The exception is Hash Agg, where all of the node’s hash tables share the same memory limit; and so I have preserved this behavior in the Hash Agg’s workmem and workmem_limit fields.) The Plan’s workmem estimate logically belongs on the Plan object (just as the Path’s workmem logically belongs on the Path), while the workmem_limit logically belongs on the PlanState. This is why workmem_limit is set inside InitPlan() — it’s an execution-time limit, not a plan-time limit. However, the workmem_limit is stored, physically, on the Plan object, not the PlanState. This is to avoid a chicken-and-egg problem: (a) The PlanState is not created until ExecInitNode(); but, (b) ExecInitNode() also typically creates the node’s data structures, sized to workmem_limit. So we need a way to set workmem_limit after the Plan has been finalized, but before any exec nodes are initialized. Accordingly, we set this field on the Plan object, with the understanding that it doesn’t “really” belong there. A nice consequence of storing workmem_limit on the Plan object, rather than the PlanState, is that the limit automatically gets serialized+deserialized to parallel workers. This simplifies Patch 3 a little bit, since we can avoid executing ExecWorkMem() on parallel workers; but it really benefits Patch 4, because it allows the ExecAssignWorkMem_hook to set a memory limit on the query, regardless of the number of parallel workers that get spawned at runtime. Notes about individual SQL operators follow-- Patch 1 reuses existing optimizer logic, as much as possible, to calculate “workmem” — rounded up to the nearest KB, and with a minimum of 64 KB. (The 64 KB minimum is because that’s the smallest a customer can set the work_mem GUC, so it seems possible that some SQL operators rely on the assumption that they’ll always get >= 64 KB of working memory.) The PostgreSQL operators that use working memory can be placed into three categories: 1. Operators that use working memory, and which also cost the possibility of spilling. For these operators, Patch 1 just reports the “nbytes” estimate that the optimizer already produces. 1a. Sort and IncrementalSort (via cost_tuplesort()). 1b. HashJoin (via ExecChooseHashTableSize()). 1c. Material (via cost_material()). 1d. Unique (via either cost_sort() or cost_agg()). 1e. Grouping Sets (via create_groupingsets_path(), which calls cost_sort() and cost_agg()). NOTE: Grouping Sets can end up creating a chain of Agg plan nodes, each of which gets its own working-memory budget. Discussed below. 1f. Agg (via cost_agg()). NOTE: Discussed below. 1g. SetOp (via create_setop_path()). NOTE: A SetOp never spills; however, existing logic disables the SetOp “if it doesn't look like the hashtable will fit into hash_mem.” It assumes the hash entry size is: MAXALIGN(leftpath->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader). 2. Operators that use working memory, but which do not currently cost the possibility of spilling, because the existing estimate is assumed to be unreliable. For these operators, Patch 1 just reports an “unreliable” estimate. 2a. FunctionScan . 2b. TableFuncScan . 3. Remaining operators that use working memory, but for whatever reason do not currently cost the possibility of spilling. For these operators, Patch 1 just computes and reports an estimate, based on logic appearing elsewhere in the code. 3a. RecursiveUnion. (Uses two Tuplestores, and possibly a TupleHashTable.) Patch 1 uses nrterm to estimate one of the Tuplestores; rterm to estimate the second Tuplestore; and (if relevant) numGroups to estimate # of hash buckets. 3b. CteScan (but *not* WorkTableScan). Relies on cost_ctescan().) Patch 1 just uses rows * width, since the output is materialized into a Tuplestore. 3c. Memoize . Patch 1 uses ndistinct to estimate # of hash buckets. 3d. WindowAgg . Patch 1 uses startup_tuples to estimate # of tuples materialized in the Tuplestore. 3e. BitmapHeapScan. Although the TID bitmaps created by the bitmapqual’s BitmapIndexScan nodes are limited to work_mem, these bitmaps lossify rather than spill. Patch 1 applies the inverse of tbm_calculate_entries() to the expected number of heap rows, produced by the optimizer. 3f. SubPlan, if it requires a hash table (and possibly a hash-NULL table). Patch 1 uses rows and rows / 16, respectively, copying the existing logic in nodeSubplan.c and subselect.c. NOTE: Since we don’t display SubPlans directly, in EXPLAIN, Patch 1 includes this working-memory estimate along with the SubPlan’s parent Plan node. Final comments -- I think the attached patch-set is useful, by itself; but it also serves as a necessary building block for future work to manage query working-memory dynamically. For example, the optimizer could be enhanced to trade off between a high-memory + low cost plan, and a low-memory + high cost plan. The execution-time extension could be enhanced to adjust its query-working-memory limit based on current, global memory usage. And individual exec nodes could be enhanced to request additional working-memory, via hook, if they discover they need to spill unexpectedly. (For example, this would be useful for serial Hash Joins.) Question / comments / suggestions / issues / complaints? Thanks, James Hunter [1] https://www.postgresql.org/message-id/flat/CAJVSvF5kMi1-fwBDSv-9bvUjm83zUNEnL95B0s%2Bi08sKDL5-mA%40mail.gmail.com#099d95841bcf533ccdd51f3708bcebc7
Attachment
pgsql-hackers by date: