Re: Batching in executor - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Batching in executor
Date
Msg-id CA+HiwqGM0ZTeVHicSkGnCp-2U-jvU-KBQCkPJ0N7nAj_c2LjZg@mail.gmail.com
Whole thread Raw
In response to Re: Batching in executor  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Batching in executor
Re: Batching in executor
List pgsql-hackers
On Tue, Oct 28, 2025 at 1:18 AM Tomas Vondra <tomas@vondra.me> wrote:
> On 10/27/25 08:24, Amit Langote wrote:
> > Thank you for reviewing the patch and taking the time to run those
> > experiments. I appreciate the detailed feedback and questions.  I also
> > apologize for my late reply, I spent perhaps way too much time going
> > over your index prefetching thread trying to understand the notion of
> > batching that it uses and getting sidelined by other things while
> > writing this reply.
>
> Cool! Now you can do a review of the index prefetch patch ;-)

Would love to and I'm adding that to my list.  :)

> >> How far ahead have you though about these capabilities? I was wondering
> >> about two things in particular. First, at which point do we have to
> >> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> >> I get it that you want to enable passing batches between nodes, but
> >> would those use the same "format" as the underlying scan node, or some
> >> generic one? Second, will it be possible to execute expressions on the
> >> custom batches (i.e. on "compressed data")? Or is it necessary to
> >> "materialize" the batch into regular tuple slots? I realize those may
> >> not be there "now" but maybe it'd be nice to plan for the future.
> >
> > I have been thinking about those future capabilities. Currently, the
> > patch keeps tuples in the TAM-specific batch format up until they need
> > to be consumed by a node that doesn’t understand that format or has
> > not been modified to invoke the TAM callbacks to decode it.  In the
> > current patch, that means we materialize to regular TupleTableSlots at
> > nodes that require it (for example, the scan node reading from TAM
> > needing to evaluate quals, etc.). However, the intention is to allow
> > batches to be passed through as many nodes as possible without
> > materialization, ideally using the same format produced by the scan
> > node all the way up until reaching a node that can only work with
> > tuples in TupleTableSlots.
> >
> > As for executing expressions directly on the custom batch data: that’s
> > something I would like to enable in the future. Right now, expressions
> > (quals, projections, etc.) are evaluated after materializing into
> > normal tuples in TupleTableSlots stored in TupleBatch, because the
> > expression evaluation code isn’t yet totally batch-aware or is very
> > from doing things like operate on compressed data in its native form.
> > Patches 0004-0008 do try to add batch-aware expression evaluation but
> > that's just a prototype.  In the long term, the goal is to allow
> > expression evaluation on batch data (for example, applying a WHERE
> > clause or aggregate transition directly on a columnar batch without
> > converting it to heap tuples first). This will require significant new
> > infrastructure (perhaps specialized batch-aware expression operators
> > and functions), so it's not in the current patch, but I agree it's
> > important to plan for it. The current design doesn’t preclude it, it
> > lays some groundwork by introducing the batch abstraction -- but fully
> > supporting that will be future work.
> >
> > That said, one area I’d like to mention while at it, especially to
> > enable native execution on compressed or columnar batches, is giving
> > the table AM more control over how expression evaluation is performed
> > on its batch data. In the current patch, the AM can provide a
> > materialize function via TupleBatchOps, but that always produces an
> > array of TupleTableSlots stored in the TupleBatch, not an opaque
> > representation that remains under AM control. Maybe that's not bad for
> > a v1 patch.
>
> I think materializing into a batch of TupleTableSlots (and then doing
> the regular expression evaluation) seems perfectly fine for v1. It's the
> simplest fallback possible, and we'll need it anyway if overriding the
> expression evaluation will be optional (which I assume it will be?).

Yes.  The ability to materialize into TupleTableSlots won't be
optional for the table AM's BatchOps.  Converting to other formats
would be.

> > When evaluating expressions over a batch, a BatchVector
> > is built by looping over these slots and invoking the standard
> > per-tuple getsomeattrs() to "deform" a tuple into needed columns.
> > While that enables batch-style EEOPs for qual evaluation and aggregate
> > transition (and is already a gain over per-row evaluation), it misses
> > the opportunity to leverage any batch-specific optimizations the AM
> > could offer, such as vectorized decoding or filtering over compressed
> > data, and other AM optimizations for getting only the necessary
> > columns out possibly in a vector format.
> >
>
> I'm not sure about this BatchVector thing. I haven't looked into that
> very much, I'd expect the construction to be more expensive than the
> benefits (compared to just doing the materialize + regular evaluation),
> but maybe I'm completely wrong. Or maybe we could keep the vector
> representation for multiple operations? No idea.

Constructing the BatchVector does require looping over the batch and
deforming each tuple, typically via getsomeattrs(). So yes, there’s an
up-front cost similar to materialization. But the goal is to amortize
that by enabling expression evaluation to run in a tight loop over
column vectors, avoiding repeated jumps into slot/AM code for each
tuple and each column. That can reduce branching and improve locality.

In its current form, the BatchVector is ephemeral -- it's built just
before expression evaluation and discarded after. But your idea of
reusing the same vector across multiple operations is interesting.
That would let us spread out the construction cost even further and
might be necessary to justify the overhead fully in some cases. I’ll
keep that in mind.

> But it seems like a great area for experimenting ...

Yep.

> > I’m considering extending TupleTableSlotOps with a batch-aware variant
> > of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
> > AMs can populate column vectors (e.g., BatchVector) directly from
> > their native format. That would allow bypassing slot materialization
> > entirely and plug AM-provided decoding logic directly into the
> > executor’s batch expression paths. This isn’t implemented yet, but I
> > see it as a necessary step toward supporting fully native expression
> > evaluation over compressed or columnar formats. I’m not yet sure if
> > TupleTableSlotOps is the right place for such a hook, it might belong
> > elsewhere in the abstraction, but exposing a batch-aware interface for
> > this purpose seems like the right direction.
> >
>
> No opinion. I don't see it as a necessary prerequisite for the other
> parts of the patch series, but maybe the BatchVector really helps, and
> then this would make perfect sense. I'm not sure there's a single
> "correct" sequence in which to do these improvements, it's always a
> matter of opinion.

Yes, I think we can come back to this later.

> >> The other option would be to "create batches" during execution, say by
> >> having a new node that accumulates tuples, builds a batch and sends it
> >> to the node above. This would help both in cases when either the lower
> >> node does not produce batches at all, or the batches are too small (due
> >> to filtering, aggregation, ...). Or course, it'd only win if this
> >> increases efficiency of the upper part of the plan enough to pay for
> >> building the batches. That can be a hard decision.
> >
> > Yes, introducing a dedicated executor node to accumulate and form
> > batches on the fly is an interesting idea, I have thought about it and
> > even mentioned it in passing in the pgconf.dev unconference. This
> > could indeed cover scenarios where the data source (a node) doesn't
> > produce batches (e.g., a non-batching node feeding into a
> > batching-aware upper node) or where batches coming from below are too
> > small to be efficient. The current patch set doesn’t implement such a
> > node; I focused on enabling batching at the scan/TAM level first. The
> > cost/benefit decision for a batch-aggregator node is tricky, as you
> > said. We’d need a way to decide when the overhead of gathering tuples
> > into a batch is outweighed by the benefits to the upper node. This
> > likely ties into costing or adaptive execution decisions. It's
> > something I’m open to exploring in a future iteration, perhaps once we
> > have more feedback on how the existing batching performs in various
> > scenarios. It might also require some planner or executor smarts
> > (maybe the executor can decide to batch on the fly if it sees a
> > pattern of use, or the planner could insert such nodes when
> > beneficial).
> >
>
> Yeah, those are good questions. I don't have a clear idea how should we
> decide when to do this batching. Costing during planning is the
> "traditional" option, with all the issues (e.g. it requires a reasonably
> good cost model). Another option would be some sort of execution-time
> heuristics - buts then which node would be responsible for building the
> batches (if we didn't create them during planning)?
>
> I agree it makes sense to focus on batching at the TAM/scan level for
> now. That's a pretty big project already.

Right -- batching at the TAM/scan level is already a sizable project,
especially given its interaction with prefetching work (maybe). I
think it's best to focus design effort there and on batched expression
evaluation first, and only revisit batch-creation nodes once that
groundwork is in place.

> >> In fact, how shall the optimizer decide whether to use batching? It's
> >> one thing to decide whether a node can produce/consume batches, but
> >> another thing is "should it"? With a node that "builds" a batch, this
> >> decision would apply to even more plans, I guess.
> >>
> >> I don't have a great answer to this, it seems like an incredibly tricky
> >> costing issue. I'm a bit worried we might end up with something too
> >> coarse, like "jit=on" which we know is causing problems (admittedly,
> >> mostly due to a lot of the LLVM work being unpredictable/external). But
> >> having some "adaptive" heuristics (like the gradual ramp up) might make
> >> it less risky.
> >
> > I agree that deciding when to use batching is tricky. So far, the
> > patch takes a fairly simplistic approach: if a node (particularly a
> > scan node) supports batching, it just does it, and other parts of the
> > plan will consume batches if they are capable. There isn’t yet a
> > nuanced cost-based decision in the planner for enabling batching. This
> > is indeed something we’ll have to refine. We don’t want to end up with
> > a blunt on/off GUC that could cause regressions in some cases.
> >
> > One idea is to introduce costing for batching: for example, estimate
> > the per-tuple savings from batching vs the overhead of materialization
> > or batch setup. However, developing a reliable cost model for that
> > will take time and experimentation, especially with the possibility of
> > variable batch sizes or adaptive behavior. Not to mention, that will
> > be adding one more dimension to planner's costing model making the
> > planning more expensive and unpredictable.  In the near term, I’m fine
> > with relying on feedback and perhaps manual tuning (GUCs, etc.) to
> > decide on batching, but that’s perhaps not a long-term solution.
> >
>
> Yeah, the cost model is going to be hard, because this depends on so
> much low-level plan/hardware details. Like, the TAM may allow execution
> on compressed data / leverage vectorization, .... But maybe the CPU does
> not do that efficiently? There's so many unknown unknowns ...
>
> Considering we still haven't fixed the JIT cost model, maybe it's better
> to not rely on it too much for this batching patch? Also, all those
> details contradict the idea that cost models are a simplified model of
> the reality.

Yeah, totally agreed -- the complexity and unpredictability here are
real, and your point about JIT costing is a good reminder not to
over-index on planner models for now.

> > I share your inclination that adaptive heuristics might be the safer
> > path initially. Perhaps the executor can decide to batch or not batch
> > based on runtime conditions. The gradual ramp-up of batch size is one
> > such adaptive approach. We could also consider things like monitoring
> > how effective batching is (are we actually processing full batches or
> > frequently getting cut off?) and adjust behavior. These are somewhat
> > speculative ideas at the moment, but the bottom line is I’m aware we
> > need a smarter strategy than a simple switch. This will likely evolve
> > as we test the patch in more scenarios.
> >
>
> I think the big question is how much can the batching change the
> relative cost of two plans (I mean, actual cost, not just estimates).
>
> Imagine plans P1 and P2, where
>
>    cost(P1) < cost(P2) = cost(P1) + delta
>
> where "delta" is small (so P1 is faster, but not much).  If we
> "batchify" the plans into P1' and P2', can this happen?
>
>   cost(P1') >> cost(P2')
>
> That is, can the "slower" plan P2 benefit much more from the batching,
> making it significantly faster?
>
> If this is unlikely, we could entirely ignore batching during planning,
> and only do that as post-processing on the selected plan, or perhaps
> even just during execution.
>
> OTOH that's what JIT does, and we know it's not perfect - but that's
> mostly because JIT has rather unpredictable costs when enabling. Maybe
> batching doesn't have that.

That’s an interesting scenario. I suspect batching (even with SIMD)
won’t usually flip plan orderings that dramatically -- i.e., turning
the clearly slower plan into the faster one -- though I could be
wrong. But I agree with the conclusion: this supports treating
batching as an executor concern, at least initially. Might be worth
seeing if there’s any relevant guidance in systems literature too.

> >> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> >> hard to say. It'd be good to be able to experiment with different
> >> values, so I suggest we make this a GUC and not a hard-coded constant.
> >
> > Yeah, I was thinking the same while testing -- the optimal batch size
> > might vary by workload or hardware, and 64 was a somewhat arbitrary
> > starting point. I will make the batch size limit configurable
> > (probably as a GUC executor_batch_tuples, maybe only developer-focused
> > at first). That will let us and others experiment easily with
> > different batch sizes to see how it affects performance. It should
> > also help with your earlier point: for example, on a machine where 64
> > is too low or too high, we can adjust it without recompiling. So yes,
> > I'll add a GUC for the batch size in the next version of the patch.
> >
>
> +1 to have developer-only GUC for testing. But the goal should be to not
> expect users to tune this.

Yes.

> >> As for what to add to explain, I'd start by adding info about which
> >> nodes are "batched" (consuming/producing batches), and some info about
> >> the batch sizes. An average size, maybe a histogram if you want to be a
> >> bit fancy.
> >
> > Adding more information to EXPLAIN is a good idea. In the current
> > patch, EXPLAIN does not show anything about batching, but it would be
> > very helpful for debugging and user transparency to indicate which
> > nodes are operating in batch mode.  I will update EXPLAIN to mark
> > nodes that produce or consume batches. Likely I’ll start with
> > something simple like an extra line or tag for a node, e.g., "Batch:
> > true (avg batch size 64)" or something along those lines. An average
> > batch size could be computed if we have instrumentation, which would
> > be useful to see if, say, the batch sizes ended up smaller due to
> > LIMIT or other factors. A full histogram might be more detail than
> > most users need, but I agree even just knowing average or maximum
> > batch size per node could be useful for performance analysis. I'll
> > implement at least the basics for now, and we can refine it (maybe add
> > more stats) if needed.
>
> +1 to start with something simple
>
> >
> > (I had added a flag in the EXPLAIN output at one point, but removed it
> > due to finding the regression output churn too noisy, though I
> > understand I'll have to bite the bullet at some point.)
> >
>
> Why would there be regression churn, if the option is disabled by default?

executor_batching is on my default in my patch, so a seq scan will
always use batching provided the query features preventing it are not
present, which is true for a huge number of plans appearing in
regression suite output.

> >> Now, numbers from some microbenchmarks:
> >>
> >> ...
> >>>> Perhaps I did something wrong. It does not surprise me this is somewhat
> >> CPU dependent. It's a bit sad the improvements are smaller for the newer
> >> CPU, though.
> >
> > Thanks for sharing your benchmark results -- that’s very useful data.
> > I haven’t yet finished investigating why there's a regression relative
> > to master when executor_batching is turned off. I re-ran my benchmarks
> > to include comparisons with master and did observe some regressions in
> > a few cases too, but I didn't see anything obvious in profiles that
> > explained the slowdown. I initially assumed it might be noise, but now
> > I suspect it could be related to structural changes in the scan code
> > -- for example, I added a few new fields in the middle of
> > HeapScanDescData, and even though the batching logic is bypassed when
> > executor_batching is off, it’s possible that change alone affects
> > memory layout or cache behavior in a way that penalizes the unbatched
> > path. I haven’t confirmed that yet, but it’s on my list to look into
> > more closely.
> >
> > Your observation that newer CPUs like the Ryzen may see smaller
> > improvements makes sense -- perhaps they handle the per-tuple overhead
> > more efficiently to begin with. Still, I’d prefer not to see
> > regressions at all, even in the unbatched case, so I’ll focus on
> > understanding and fixing that part before drawing conclusions from the
> > performance data.
> >
> > Thanks again for the scripts -- those will help a lot in narrowing things down.
>
> If needed, I can rerun the tests and collect additional information
> (e.g. maybe perf-stat or perf-diff would be interesting).

That would be nice to see if you have the time, but maybe after I post
a new version.

--
Thanks, Amit Langote



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Extend injection_points_attach() to accept a user-defined function
Next
From: Daniil Davydov
Date:
Subject: Re: Fix bug with accessing to temporary tables of other sessions