Re: Batching in executor - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Batching in executor
Date
Msg-id CA+HiwqGqeS_94wxiYa8VpymiR_OtFPKDSpX+Me=MYWO45f5yig@mail.gmail.com
Whole thread Raw
In response to Re: Batching in executor  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Batching in executor
List pgsql-hackers
Hi Tomas,

On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi Amit,
>
> Thanks for the patch. I took a look over the weekend, and done a couple
> experiments / benchmarks, so let me share some initial feedback (or
> rather a bunch of questions I came up with).

Thank you for reviewing the patch and taking the time to run those
experiments. I appreciate the detailed feedback and questions.  I also
apologize for my late reply, I spent perhaps way too much time going
over your index prefetching thread trying to understand the notion of
batching that it uses and getting sidelined by other things while
writing this reply.

> I'll start with some general thoughts, before going into some nitpicky
> comments about patches / code and perf results.
>
> I think the general goal of the patch - reducing the per-tuple overhead
> and making the executor more efficient for OLAP workloads - is very
> desirable. I believe the limitations of per-row executor are one of the
> reasons why attempts to implement a columnar TAM mostly failed. The
> compression is nice, but it's hard to be competitive without an executor
> that leverages that too. So starting with an executor, in a way that
> helps even heap, seems like a good plan. So +1 to this.

I'm happy to hear that you find the overall direction worthwhile.

> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.
>
> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.
>
> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.

Yes, the design of this patch does differ from the index prefetching
approach, and that’s largely due to the differing goals as you say.
AIUI, the index prefetching patch uses a shared batch structure
managed mostly by indexam.c and populated by the index AM.  In my
patch, each table AM produces its own batch format that gets wrapped
in a generic TupleBatch which contains the AM-specified TupleBatchOps
for operations on the AM's opaque data. This was a conscious choice:
in prefetching, the aim seems to be to make indexam.c manage batches
and operations based on it in a mostly AM-agnostic manner.  But for
executor batching, the aim is to retain TAM-specific optimizations as
much as possible and rely on the TAM for most operations on the batch
contents. Both designs have their merits given their respective use
cases, but I guess you understand that very well.

> But for the batching, we want to retain the custom format as long as
> possible. Presumably, the various advantages of the TAMs are tied to the
> custom/columnar storage format. Memory efficiency thanks to compression,
> execution on compressed data, etc. Keeping the custom format as long as
> possible is the whole point of "late materialization" (and materializing
> as late as possible is one of the important details in column stores).

Exactly -- keeping the TAM-specific batch format as long as possible
is a key goal here. As you noted, the benefits of a custom storage
format (compression, operating on compressed data, etc.) are best
realized when we delay materialization until absolutely necessary.  I
want to design this patch that each TAM can produce and use its own
batch representation internally, only wrapping it when interfacing
with the executor in a generic way.  I admit that's not entirely true
with the patch as it stands as I write above below.

> How far ahead have you though about these capabilities? I was wondering
> about two things in particular. First, at which point do we have to
> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> I get it that you want to enable passing batches between nodes, but
> would those use the same "format" as the underlying scan node, or some
> generic one? Second, will it be possible to execute expressions on the
> custom batches (i.e. on "compressed data")? Or is it necessary to
> "materialize" the batch into regular tuple slots? I realize those may
> not be there "now" but maybe it'd be nice to plan for the future.

I have been thinking about those future capabilities. Currently, the
patch keeps tuples in the TAM-specific batch format up until they need
to be consumed by a node that doesn’t understand that format or has
not been modified to invoke the TAM callbacks to decode it.  In the
current patch, that means we materialize to regular TupleTableSlots at
nodes that require it (for example, the scan node reading from TAM
needing to evaluate quals, etc.). However, the intention is to allow
batches to be passed through as many nodes as possible without
materialization, ideally using the same format produced by the scan
node all the way up until reaching a node that can only work with
tuples in TupleTableSlots.

As for executing expressions directly on the custom batch data: that’s
something I would like to enable in the future. Right now, expressions
(quals, projections, etc.) are evaluated after materializing into
normal tuples in TupleTableSlots stored in TupleBatch, because the
expression evaluation code isn’t yet totally batch-aware or is very
from doing things like operate on compressed data in its native form.
Patches 0004-0008 do try to add batch-aware expression evaluation but
that's just a prototype.  In the long term, the goal is to allow
expression evaluation on batch data (for example, applying a WHERE
clause or aggregate transition directly on a columnar batch without
converting it to heap tuples first). This will require significant new
infrastructure (perhaps specialized batch-aware expression operators
and functions), so it's not in the current patch, but I agree it's
important to plan for it. The current design doesn’t preclude it, it
lays some groundwork by introducing the batch abstraction -- but fully
supporting that will be future work.

That said, one area I’d like to mention while at it, especially to
enable native execution on compressed or columnar batches, is giving
the table AM more control over how expression evaluation is performed
on its batch data. In the current patch, the AM can provide a
materialize function via TupleBatchOps, but that always produces an
array of TupleTableSlots stored in the TupleBatch, not an opaque
representation that remains under AM control. Maybe that's not bad for
a v1 patch.  When evaluating expressions over a batch, a BatchVector
is built by looping over these slots and invoking the standard
per-tuple getsomeattrs() to "deform" a tuple into needed columns.
While that enables batch-style EEOPs for qual evaluation and aggregate
transition (and is already a gain over per-row evaluation), it misses
the opportunity to leverage any batch-specific optimizations the AM
could offer, such as vectorized decoding or filtering over compressed
data, and other AM optimizations for getting only the necessary
columns out possibly in a vector format.

I’m considering extending TupleTableSlotOps with a batch-aware variant
of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
AMs can populate column vectors (e.g., BatchVector) directly from
their native format. That would allow bypassing slot materialization
entirely and plug AM-provided decoding logic directly into the
executor’s batch expression paths. This isn’t implemented yet, but I
see it as a necessary step toward supporting fully native expression
evaluation over compressed or columnar formats. I’m not yet sure if
TupleTableSlotOps is the right place for such a hook, it might belong
elsewhere in the abstraction, but exposing a batch-aware interface for
this purpose seems like the right direction.

> It might be worth exploring some columnar formats, and see if this
> design would be a good fit. Let's say we want to process data read from
> a parquet file. Would we be able to leverage the format, or would we
> need to "materialize" into slots too early? Or maybe it'd be good to
> look at the VCI extension [1], discussed in a nearby thread. AFAICS
> that's still based on an index AM, but there were suggestions to use TAM
> instead (and maybe that'd be a better choice).

Yeah, looking at columnar TAMs or FDWs is on my list. I do think the
design should be able to accommodate true columnar formats like
Parquet. If we had a table AM (or FDW) that reads Parquet files into a
columnar batch structure, the executor batching framework should
ideally allow us to pass that batch along without immediately
materializing to tuples.  As mentioned before, we might have to adjust
or extend the TupleBatch abstraction to handle a wider variety of
batch formats, but conceptually it fits -- the goal is to avoid
forcing early materialization. I will definitely keep the Parquet
use-case in mind and perhaps do some experiments with a columnar
source to ensure we aren’t baking in any unnecessary materialization.
Also, thanks for the reference to the VCI extension thread; I'll take
a look at that.

> The other option would be to "create batches" during execution, say by
> having a new node that accumulates tuples, builds a batch and sends it
> to the node above. This would help both in cases when either the lower
> node does not produce batches at all, or the batches are too small (due
> to filtering, aggregation, ...). Or course, it'd only win if this
> increases efficiency of the upper part of the plan enough to pay for
> building the batches. That can be a hard decision.

Yes, introducing a dedicated executor node to accumulate and form
batches on the fly is an interesting idea, I have thought about it and
even mentioned it in passing in the pgconf.dev unconference. This
could indeed cover scenarios where the data source (a node) doesn't
produce batches (e.g., a non-batching node feeding into a
batching-aware upper node) or where batches coming from below are too
small to be efficient. The current patch set doesn’t implement such a
node; I focused on enabling batching at the scan/TAM level first. The
cost/benefit decision for a batch-aggregator node is tricky, as you
said. We’d need a way to decide when the overhead of gathering tuples
into a batch is outweighed by the benefits to the upper node. This
likely ties into costing or adaptive execution decisions. It's
something I’m open to exploring in a future iteration, perhaps once we
have more feedback on how the existing batching performs in various
scenarios. It might also require some planner or executor smarts
(maybe the executor can decide to batch on the fly if it sees a
pattern of use, or the planner could insert such nodes when
beneficial).

> You also mentioned we could make batches larger by letting them span
> multiple pages, etc. I'm not sure that's worth it - wouldn't that
> substantially complicate the TAM code, which would need to pin+track
> multiple buffers for each batch, etc.? Possible, but is it worth it?
>
> I'm not sure allowing multi-page batches would actually solve the issue.
> It'd help with batches at the "scan level", but presumably the batch
> size in the upper nodes matters just as much. Large scan batches may
> help, but hard to predict.
>
> In the index prefetching patch we chose to keep batches 1:1 with leaf
> pages, at least for now. Instead we allowed having multiple batches at
> once. I'm not sure that'd be necessary for TAMs, though.

I tend to agree with you here. Allowing a single batch to span
multiple pages would add quite a bit of complexity to the table AM
implementations (managing multiple buffer pins per batch, tracking
page boundaries, etc.), and it's unclear if the benefit would justify
that complexity. For now, I'm inclined not to pursue multi-page
batches at the scan level in this patch. We can keep the batch
page-local (e.g., for heap, one batch corresponds to max one page, as
it does now). If we need larger batch sizes overall, we might address
that by other means -- for example, by the above-mentioned idea of a
higher-level batching node or by simply producing multiple batches in
quick succession.

You’re right that even if we made scan batches larger, it doesn’t
necessarily solve everything, since the effective batch size at
higher-level nodes could still be constrained by other factors. So
rather than complicating the low-level TAM code with multi-page
batches, I'd prefer to first see if the current approach (with
one-page batches) yields good benefits and then consider alternatives.
We could also consider letting a scan node produce multiple batches
before yielding to the upper node (similar to how the index
prefetching patch can have multiple leaf page batches in flight) if
needed, but as you note, it might not be necessary for TAMs yet. So at
this stage, I'll keep it simple.

> This also reminds me of LIMIT queries. The way I imagine a "batchified"
> executor to work is that batches are essentially "units of work". For
> example, a nested loop would grab a batch of tuples from the outer
> relation, lookup inner tuples for the whole batch, and only then pass
> the result batch. (I'm ignoring the cases when the batch explodes due to
> duplicates.)
>
> But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
> just the first tuple, and the rest of the batch is wasted work? Plenty
> of (very expensive) OLAP have that, and many would likely benefit from
> batching, so just disabling batching if there's LIMIT seems way too
> heavy handed.

Yeah, LIMIT does complicate downstream batching decisions. If we
always use a full-size batch (say 64 tuples) for every operation, a
query with LIMIT 1 could end up doing a lot of unnecessary work
fetching and processing 63 tuples that never get used. Disabling
batching entirely for queries with LIMIT would indeed be overkill and
lose benefits for cases where the limit is not extremely selective.

> Perhaps it'd be good to gradually ramp up the batch size? Start with
> small batches, and then make them larger. The index prefetching does
> that too, indirectly - it reads the whole leaf page as a batch, but then
> gradually ramps up the prefetch distance (well, read_stream does that).
> Maybe the batching should have similar thing ...

An adaptive batch size that ramps up makes a lot of sense as a
solution. We could start with a very small batch (say 4 tuples) and if
we detect that the query needs more (e.g., the LIMIT wasn’t satisfied
yet or more output is still being consumed), then increase the batch
size for subsequent operations. This way, a query that stops early
doesn’t incur the full batching overhead, whereas a query that does
process lots of tuples will gradually get to a larger batch size to
gain efficiency. This is analogous to how the index prefetching ramps
up prefetch distance, as you mentioned.

Implementing that will require some careful thought. It could be done
either in the planner (choose initial batch sizes based on context
like LIMIT) or more dynamically in the executor (adjust on the fly). I
lean towards a runtime heuristic because it’s hard for the planner to
predict exactly how a LIMIT will play out, especially in complex
plans. In any case, I agree that a gradual ramp-up or other adaptive
approach would make batching more robust in the presence of query
execution variability. I will definitely consider adding such logic,
perhaps as an improvement once the basic framework is in.

> In fact, how shall the optimizer decide whether to use batching? It's
> one thing to decide whether a node can produce/consume batches, but
> another thing is "should it"? With a node that "builds" a batch, this
> decision would apply to even more plans, I guess.
>
> I don't have a great answer to this, it seems like an incredibly tricky
> costing issue. I'm a bit worried we might end up with something too
> coarse, like "jit=on" which we know is causing problems (admittedly,
> mostly due to a lot of the LLVM work being unpredictable/external). But
> having some "adaptive" heuristics (like the gradual ramp up) might make
> it less risky.

I agree that deciding when to use batching is tricky. So far, the
patch takes a fairly simplistic approach: if a node (particularly a
scan node) supports batching, it just does it, and other parts of the
plan will consume batches if they are capable. There isn’t yet a
nuanced cost-based decision in the planner for enabling batching. This
is indeed something we’ll have to refine. We don’t want to end up with
a blunt on/off GUC that could cause regressions in some cases.

One idea is to introduce costing for batching: for example, estimate
the per-tuple savings from batching vs the overhead of materialization
or batch setup. However, developing a reliable cost model for that
will take time and experimentation, especially with the possibility of
variable batch sizes or adaptive behavior. Not to mention, that will
be adding one more dimension to planner's costing model making the
planning more expensive and unpredictable.  In the near term, I’m fine
with relying on feedback and perhaps manual tuning (GUCs, etc.) to
decide on batching, but that’s perhaps not a long-term solution.

I share your inclination that adaptive heuristics might be the safer
path initially. Perhaps the executor can decide to batch or not batch
based on runtime conditions. The gradual ramp-up of batch size is one
such adaptive approach. We could also consider things like monitoring
how effective batching is (are we actually processing full batches or
frequently getting cut off?) and adjust behavior. These are somewhat
speculative ideas at the moment, but the bottom line is I’m aware we
need a smarter strategy than a simple switch. This will likely evolve
as we test the patch in more scenarios.

> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> hard to say. It'd be good to be able to experiment with different
> values, so I suggest we make this a GUC and not a hard-coded constant.

Yeah, I was thinking the same while testing -- the optimal batch size
might vary by workload or hardware, and 64 was a somewhat arbitrary
starting point. I will make the batch size limit configurable
(probably as a GUC executor_batch_tuples, maybe only developer-focused
at first). That will let us and others experiment easily with
different batch sizes to see how it affects performance. It should
also help with your earlier point: for example, on a machine where 64
is too low or too high, we can adjust it without recompiling. So yes,
I'll add a GUC for the batch size in the next version of the patch.

> As for what to add to explain, I'd start by adding info about which
> nodes are "batched" (consuming/producing batches), and some info about
> the batch sizes. An average size, maybe a histogram if you want to be a
> bit fancy.

Adding more information to EXPLAIN is a good idea. In the current
patch, EXPLAIN does not show anything about batching, but it would be
very helpful for debugging and user transparency to indicate which
nodes are operating in batch mode.  I will update EXPLAIN to mark
nodes that produce or consume batches. Likely I’ll start with
something simple like an extra line or tag for a node, e.g., "Batch:
true (avg batch size 64)" or something along those lines. An average
batch size could be computed if we have instrumentation, which would
be useful to see if, say, the batch sizes ended up smaller due to
LIMIT or other factors. A full histogram might be more detail than
most users need, but I agree even just knowing average or maximum
batch size per node could be useful for performance analysis. I'll
implement at least the basics for now, and we can refine it (maybe add
more stats) if needed.

(I had added a flag in the EXPLAIN output at one point, but removed it
due to finding the regression output churn too noisy, though I
understand I'll have to bite the bullet at some point.)

> Now, numbers from some microbenchmarks:
>
> On 9/26/25 15:28, Amit Langote wrote:
> > To evaluate the overheads and benefits, I ran microbenchmarks with
> > single and multi-aggregate queries on a single table, with and without
> > WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> > and IO costs are minimal. shared_buffers was large enough to fit the
> > whole table (up to 10M rows, ~43 on each page), and all pages were
> > prewarmed into cache before tests. Table schema/script is at [2].
> >
> > Observations from benchmarking (Detailed benchmark tables are at [3];
> > below is just a high-level summary of the main patterns):
> >
> > * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> > sum(a) FROM bar_N): batching scan output alone improved latency by
> > ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> > especially once fmgr overhead was paid per batch instead of per row.
> >
> > * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> > qual interpreter gave a big step up, with latencies dropping by
> > ~30-40% compared to batching=off.
> >
> > * Five aggregates, no WHERE: batching input from the child scan cut
> > ~15% off runtime. Adding batched transition evaluation increased
> > improvements to ~30%.
> >
> > * Five aggregates, with WHERE: modest gains from scan/input batching,
> > but per-batch transition evaluation and batched quals brought ~20-30%
> > improvement.
> >
> > * Across all cases, executor overheads became visible only after IO
> > was minimized. Once executor cost dominated, batching consistently
> > reduced CPU time, with the largest benefits coming from avoiding
> > per-row fmgr calls and evaluating quals across batches.
> >
> > I would appreciate if others could try these patches with their own
> > microbenchmarks or workloads and see if they can reproduce numbers
> > similar to mine. Feedback on both the general direction and the
> > details of the patches would be very helpful. In particular, patches
> > 0001-0003, which add the basic batch APIs and integrate them into
> > SeqScan, are intended to be the first candidates for review and
> > eventual commit. Comments on the later, more experimental patches
> > (aggregate input batching and expression evaluation (qual, aggregate
> > transition) batching) are also welcome.
> >
>
> I tried to replicate the results, but the numbers I see are not this
> good. In fact, I see a fair number of regressions (and some are not
> negligible).
>
> I'm attaching the scripts I used to build the tables / run the test. I
> used the same table structure, and tried to follow the same query
> pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
> conditions (with 100% selectivity).
>
> I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
> And I did that on my (relatively) new ryzen machine, and old xeon. The
> behavior is quite different for the two machines, but none of them shows
> such improvements. I used clang 19.0, and --with-llvm.
>
> See the attached PDFs with a summary of the results, comparing the
> results for master and the two batching branches.
>
> The ryzen is much "smoother" - it shows almost no difference with
> batching "off" (as expected). The "scan" branch (with 0001-0003) shows
> an improvement of 5-10% - it's consistent, but much less than the 10-20%
> you report. For the "agg" branch the benefits are much larger, but
> there's also a significant regression for the largest table with 100M
> rows (which is ~18GB on disk).
>
> For xeon, the results are a bit more variable, but it affects runs both
> with batching "on" and "off". The machine is just more noisy. There
> seems to be a small benefit of "scan" batching (in most cases much less
> than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
> and no regression similar to the ryzen.
>
> Perhaps I did something wrong. It does not surprise me this is somewhat
> CPU dependent. It's a bit sad the improvements are smaller for the newer
> CPU, though.

Thanks for sharing your benchmark results -- that’s very useful data.
I haven’t yet finished investigating why there's a regression relative
to master when executor_batching is turned off. I re-ran my benchmarks
to include comparisons with master and did observe some regressions in
a few cases too, but I didn't see anything obvious in profiles that
explained the slowdown. I initially assumed it might be noise, but now
I suspect it could be related to structural changes in the scan code
-- for example, I added a few new fields in the middle of
HeapScanDescData, and even though the batching logic is bypassed when
executor_batching is off, it’s possible that change alone affects
memory layout or cache behavior in a way that penalizes the unbatched
path. I haven’t confirmed that yet, but it’s on my list to look into
more closely.

Your observation that newer CPUs like the Ryzen may see smaller
improvements makes sense -- perhaps they handle the per-tuple overhead
more efficiently to begin with. Still, I’d prefer not to see
regressions at all, even in the unbatched case, so I’ll focus on
understanding and fixing that part before drawing conclusions from the
performance data.

Thanks again for the scripts -- those will help a lot in narrowing things down.

> I also tried running TPC-H. I don't have useful numbers yet, but I ran
> into a segfault - see the attached backtrace. It only happens with the
> batching, and only on Q22 for some reason. I initially thought it's a
> bug in clang, because I saw it with clang-22 built from git, and not
> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
> debian 13). Still could be a clang bug, of course. I've seen ~20 of
> those segfaults so far, and the backtraces look exactly the same.

The v3 I posted fixes a tricky bug in the new EEOPs for batched-agg
evaluation that I suspect is also causing the crash you saw.

I'll try to post a v4 in a couple of weeks with some of the things I
mentioned above.

--
Thanks, Amit Langote



pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Ajin Cherian
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance