Thread: track needed attributes in plan nodes for executor use
Hi, I’ve been experimenting with an optimization that reduces executor overhead by avoiding unnecessary attribute deformation. Specifically, if the executor knows which attributes are actually needed by a plan node’s targetlist and qual, it can skip deforming unused columns entirely. In a proof-of-concept patch, I initially computed the needed attributes during ExecInitSeqScan by walking the plan’s qual and targetlist to support deforming only what’s needed when evaluating expressions in ExecSeqScan() or the variant thereof (I started with SeqScan to keep the initial patch minimal). However, adding more work to ExecInit* adds to executor startup cost, which we should generally try to reduce. It also makes it harder to apply the optimization uniformly across plan types. I’d now like to propose computing the needed attributes at planning time instead. This can be done at the bottom of create_plan_recurse, after the plan node has been constructed. A small helper like record_needed_attrs(plan) can walk the node’s targetlist and qual using pull_varattnos() and store the result in a new Bitmapset *attr_used field in the Plan struct. System attributes returned by pull_varattnos() can be filtered out during this step, since they're either not relevant to deformation or not performance sensitive. This also lays the groundwork for a related executor-side optimization that David Rowley suggested to me off-list. The idea is to remember, in the TupleDesc, either the attribute number or the byte offset of the first variable-length attribute. Then, if the minimum required attribute (as provided by attr_used) lies before that, the executor can safely jump directly to it using the cached offset, rather than starting deformation from attno 0 as it currently does. That avoids walking through fixed-length attributes that aren't needed -- specifically, skipping per-attribute alignment, null checking, and offset tracking for unused columns -- which reduces CPU work and avoids loading irrelevant tuple bytes into cache. With both patches in place, heap tuple deforming can skip over unused attributes entirely. For example, on a 30-column table where the first 15 columns are fixed-width, the query: select sum(a_1) from foo where a_10 = $1; which references only two fixed-width columns, ran nearly 2x faster with the optimization in place (with heap pages prewarmed into shared_buffers). In more complex plans, for example those involving a Sort or Join between the scan and aggregation, the CPU cost of the intermediate node may dominate, making deforming-related savings at the top less visible in overall performance. Still, I don't think that's a reason to avoid enabling this optimization more broadly across plan nodes. I'll post the PoC patches and performance measurements. Posting this in advance to get feedback on the proposed direction and where best to place attr_used. -- Thanks, Amit Langote
On Fri, 11 Jul 2025 at 17:16, Amit Langote <amitlangote09@gmail.com> wrote: > Hi, > > I’ve been experimenting with an optimization that reduces executor > overhead by avoiding unnecessary attribute deformation. Specifically, > if the executor knows which attributes are actually needed by a plan > node’s targetlist and qual, it can skip deforming unused columns > entirely. > > In a proof-of-concept patch, I initially computed the needed > attributes during ExecInitSeqScan by walking the plan’s qual and > targetlist to support deforming only what’s needed when evaluating > expressions in ExecSeqScan() or the variant thereof (I started with > SeqScan to keep the initial patch minimal). However, adding more work > to ExecInit* adds to executor startup cost, which we should generally > try to reduce. It also makes it harder to apply the optimization > uniformly across plan types. > > I’d now like to propose computing the needed attributes at planning > time instead. This can be done at the bottom of create_plan_recurse, > after the plan node has been constructed. A small helper like > record_needed_attrs(plan) can walk the node’s targetlist and qual > using pull_varattnos() and store the result in a new Bitmapset > *attr_used field in the Plan struct. System attributes returned by > pull_varattnos() can be filtered out during this step, since they're > either not relevant to deformation or not performance sensitive. > > This also lays the groundwork for a related executor-side optimization > that David Rowley suggested to me off-list. The idea is to remember, > in the TupleDesc, either the attribute number or the byte offset of > the first variable-length attribute. Then, if the minimum required > attribute (as provided by attr_used) lies before that, the executor > can safely jump directly to it using the cached offset, rather than > starting deformation from attno 0 as it currently does. That avoids > walking through fixed-length attributes that aren't needed -- > specifically, skipping per-attribute alignment, null checking, and > offset tracking for unused columns -- which reduces CPU work and > avoids loading irrelevant tuple bytes into cache. > > With both patches in place, heap tuple deforming can skip over unused > attributes entirely. For example, on a 30-column table where the first > 15 columns are fixed-width, the query: > > select sum(a_1) from foo where a_10 = $1; > > which references only two fixed-width columns, ran nearly 2x faster > with the optimization in place (with heap pages prewarmed into > shared_buffers). > > In more complex plans, for example those involving a Sort or Join > between the scan and aggregation, the CPU cost of the intermediate > node may dominate, making deforming-related savings at the top less > visible in overall performance. Still, I don't think that's a reason > to avoid enabling this optimization more broadly across plan nodes. > > I'll post the PoC patches and performance measurements. Posting this > in advance to get feedback on the proposed direction and where best to > place attr_used. > That's interesting. If I understand correctly, this approach wouldn't work if the first attribute is variable-length, right? -- Regards, Japin Li
On 11/7/2025 10:16, Amit Langote wrote: > Hi, > > I’ve been experimenting with an optimization that reduces executor > overhead by avoiding unnecessary attribute deformation. Specifically, > if the executor knows which attributes are actually needed by a plan > node’s targetlist and qual, it can skip deforming unused columns > entirely. Sounds promising. However, I'm not sure we're on the same page. Do you mean by the proposal an optimisation of slot_deform_heap_tuple() by providing it with a bitmapset of requested attributes? In this case, tuple header requires one additional flag to indicate a not-null, but unfilled column, to detect potential issues. > > In a proof-of-concept patch, I initially computed the needed > attributes during ExecInitSeqScan by walking the plan’s qual and > targetlist to support deforming only what’s needed when evaluating > expressions in ExecSeqScan() or the variant thereof (I started with > SeqScan to keep the initial patch minimal). However, adding more work > to ExecInit* adds to executor startup cost, which we should generally > try to reduce. It also makes it harder to apply the optimization > uniformly across plan types. I'm not sure if a lot of work will be added. However, cached generic plan execution should avoid any unnecessary overhead. > > I’d now like to propose computing the needed attributes at planning > time instead. This can be done at the bottom of create_plan_recurse, > after the plan node has been constructed. A small helper like > record_needed_attrs(plan) can walk the node’s targetlist and qual > using pull_varattnos() and store the result in a new Bitmapset > *attr_used field in the Plan struct. System attributes returned by > pull_varattnos() can be filtered out during this step, since they're > either not relevant to deformation or not performance sensitive. Why do you choose the Plan node? It seems it is relevant to only Scan nodes. Does it mean extension of the CustomScan API? > With both patches in place, heap tuple deforming can skip over unused > attributes entirely. For example, on a 30-column table where the first > 15 columns are fixed-width, the query: > > select sum(a_1) from foo where a_10 = $1; > > which references only two fixed-width columns, ran nearly 2x faster > with the optimization in place (with heap pages prewarmed into > shared_buffers). It may be profitable. However, I often encounter cases where a table has 20-40 columns, with arbitrarily mixed fixed and variable-width columns. And fetching columns by index on a 30-something column is painful. And in this area, Postgres may gain more profit by adding cost on the column number in the order_qual_clauses() - in [1] I attempted to explain how and why it should work. [1] https://open.substack.com/pub/danolivo/p/on-expressions-reordering-in-postgres -- regards, Andrei Lepikhov