Home > mailing lists
Re: Batching in executor - Mailing list pgsql-hackers

From	Amit Langote
Subject	Re: Batching in executor
Date	April 6 15:02:35
Msg-id	CA+HiwqHLjBegqeUw-babABj_icAiBTq5gh48-D3BEodVZkm8Ug@mail.gmail.com Whole thread
In response to	Re: Batching in executor (Amit Langote <amitlangote09@gmail.com>)
List	pgsql-hackers
Tree view
On Tue, Mar 24, 2026 at 9:59 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Here is a significantly revised version of the patch series. A lot has
> changed since the January submission, so I want to summarize the
> design changes before getting into the patches.  I think it does
> address the points in the two reviews that landed since v5 but maybe a
> bunch of points became moot after my rewrite of the relevant portions
> (thanks Junwang and ChangAo for the review in any case).
>
> At this point it might be better to think of this as targeting v20,
> except that if there is review bandwidth in the remaining two weeks
> before the v19 feature freeze, the rs_vistuples[] change described
> below as a standalone improvement to the existing pagemode scan path
> could be considered for v19, though that too is an optimistic
> scenario.
>
> It is also worth noting that Andres identified a number of
> inefficiencies in the existing scan path in:
>
> Re: unnecessary executor overheads around seqscans
> https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw
>
> that are worth fixing independently of batching. Some of those fixes
> may be better pursued first, both because they benefit all scan paths
> and because they would make batching's gains more honest.
>
> Separately, after looking at the previous version, Andres pointed out
> offlist two fundamental issues with the patch's design:
>
> * The heapam implementation (in a version of the patch I didn't post
> to the thread) duplicated heap_prepare_pagescan() logic in a separate
> batch-specific code path, which is not acceptable as changes should
> benefit the existing slot interface too.  Code duplication is not good
> either from a future maintainability aspect. The v5 version of that
> code is not great in that respect either; it instead duplicated
> heapggettup_pagemode() to slap batching on it.
>
> * Allocating executor_batch_rows slots on the executor side to receive
> rows from the AM adds significant overhead for slot initialization and
> management, and for non-row-organized AMs that do not produce
> individual rows at all, those slots would never be meaningfully
> populated.
>
> In any case, he just wasn't a fan of the slot-array approach the
> moment I mentioned it. The previous version had two slot arrays,
> inslots and outslots, of TTSOpsHeapTuple type (not
> TTSOpsBufferHeapTuple because buffer pins were managed by the batch
> code, which has its own modularity/correctness issues), populated via
> a materialize_all callback. A batch qual evaluator would copy
> qualifying tuples into outslots, with an activeslots pointer switching
> between the two depending on whether batch qual evaluation was used.
>
> The new design addresses both issues and differs from the previous
> version in several other ways:
>
>  * Single slot instead of slot arrays: there is a single
> TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
> was already determined by the AM via table_slot_callbacks().  The slot
> is re-pointed to each HeapTuple in the current buffer page via a new
> repoint_slot AM callback, with no materialization or copying.  Tuples
> are returned one by one from the executor's perspective, but the AM
> serves them in page-sized batches from pre-built HeapTupleData
> descriptors in rs_vistuples[], avoiding repeated descent into heapam
> per tuple.  This is heapam's implementation of the batch interface;
> there is no intention to force other AMs into the same row-oriented
> model.
>
>  * Batch qual evaluator not included: with the single-slot model,
> quals are evaluated per tuple via the existing ExecQual path after
> each repoint_slot call.  A natural next step would be a new opcode
> (EEOP) that calls repoint_slot() internally within expression
> evaluation, allowing ExecQual to advance through multiple tuples from
> the same batch without returning to the scan node each time, with qual
> results accumulated in a bitmask in ExprState.  The details of that
> will be worked out in a follow-on series.
>
> * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
> HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
> allows heap_getnextbatch() to simply advance a slice pointer into that
> array without any additional copying or re-entering heap code, making
> a separate batch-specific scan function unnecessary.
>
>  * TupleBatch renamed to RowBatch: "row batch" is more natural
> terminology for this concept and also consistent with how similar
> abstractions are named in columnar and OLAP systems.
>
>  * AM callbacks now take RowBatch directly: previously
> heap_getnextbatch() returned a void pointer that the executor would
> store into RowBatch.am_payload, because only the executor knew the
> internals of RowBatch.  Now the AM receives RowBatch directly as a
> parameter and can populate it without the executor acting as an
> intermediary.  This is also why RowBatch is introduced in its own
> patch ahead of the AM API addition, so the struct definition is
> available to both sides.
>
> Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
> instead of OffsetNumbers, as a standalone improvement to the existing
> pagemode scan path. Measured on a pg_prewarm'd  (also vaccum freeze'd
> in the all-visible case) table with 1M/5M/10M rows:
>
>   query                           all-visible      not-all-visible
>   count(*)                        -0.2% to +0.9%   -0.4% to +0.5%
>   count(*) WHERE id % 10 = 0     -1.1% to +3.4%   +0.2% to +1.5%
>   SELECT * LIMIT 1 OFFSET N      -2.2% to -0.6%   -0.9% to +6.6%
>   SELECT * WHERE id%10=0 LIMIT   -0.8% to +3.9%   +0.9% to +9.6%
>
> No significant regression on either page type. The structural
> improvement is most visible on not-all-visible pages where
> HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
> visibility checks, so persisting the result into rs_vistuples[]
> eliminates the downstream re-read (in heapgettupe_pagemode()) with no
> measurable overhead.  That said, these numbers are somewhat noisy on
> my machine.  Results on other machines would be welcome.
>
> Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
> and heapam implementation including seqscan variants that use the new
> scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
> respectively. With batching enabled (executor_batch_rows=300,
> ~MaxHeapTuplesPerPage):
>
>   query                           all-visible    not-all-visible
>   count(*)                        +11 to +15%    +9 to +13%
>   count(*) WHERE id % 10 = 0     +6 to +11%     +10 to +14%
>   SELECT * LIMIT 1 OFFSET N      +16 to +19%    +16 to +22%
>   SELECT * WHERE id%10=0 LIMIT   +8 to +10%     +8 to +13%
>
> With executor_batch_rows=0, results are within noise of master across
> all query types and sizes, confirming no regression from the
> infrastructure changes themselves.  The not-all-visible results tend
> to show slightly higher gains than the all-visible case. This is
> likely because the existing heapam code is more optimized for the
> all-visible path, so the not-all-visible path, which goes through
> HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
> more headroom that batching can exploit.
>
> Setting aside the current series for a moment, there are some broader
> design questions worth raising while we have attention on this area.
> Some of these echo points Tomas raised in his first reply on this
> thread, and I am reiterating them deliberately since I have not
> managed to fully address them on my own or I simply didn't need to for
> the TAM-to-scan-node batching and think they would benefit from wider
> input rather than just my own iteration.
>
> We should also start thinking about other ways the executor can
> consume batch rows, not always assuming they are presented as
> HeapTupleData. For instance, an AM could expose decoded column arrays
> directly to operators that can consume them, bypassing slot-based
> deform entirely, or a columnar AM could implement scan_getnextbatch by
> decoding column strips directly into the batch without going through
> per-tuple HeapTupleData at all. Feedback on whether the current
> RowBatch design and the choices made in the scan_getnextbatch and
> RowBatchOps API make that sort of thing harder than it needs to be
> would be appreciated. For example, heapam's implementation of
> scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
> to HeapTupleData entries one at a time via repoint_slot in
> RowBatchHeapOps. That works for heapam but a columnar AM could
> implement scan_getnextbatch to decode column strips directly into
> arrays in the batch, with no per-row repoint step needed at all. Any
> adjustments that would make RowBatch more AM-agnostic are worth
> discussing now before the design hardens.
>
> There are also broader open questions about how far the batch model
> can extend beyond the scan node. Qual pushdown into the AM has been
> discussed in nearby threads and would be one way to allow expression
> evaluation to happen before data reaches the executor proper, though
> that is a separate effort. For the purposes of this series, expression
> evaluation still happens in the executor after scan_getnextbatch
> returns. If the scan node does not project, the buffer heap slot is
> passed directly to the parent node, which calls slot callbacks to
> deform as needed. But once a node above projects, aggregates, or
> joins, the notion of a page-sized batch from a single AM loses its
> meaning and virtual slots take over. Whether RowBatch is usable or
> meaningful beyond the scan/TAM boundary in any form, and whether the
> core executor will ever have non-HeapTupleData batch consumption paths
> or leave that entirely to extensions, are open questions worth
> discussing.
>
> For RowBatch to eventually play the role that TupleTableSlot plays for
> row-at-a-time execution, something inside it would need to serve as
> the common currency for batch data, analogous to TupleTableSlot's
> datum/isnull arrays. Column arrays are the obvious direction, but even
> that leaves open the question of representation. PostgreSQL's Datum is
> a pointer-sized abstraction that boxes everything, whereas vectorized
> systems use typed packed arrays of native types with validity
> bitmasks, which is a significant part of why tight vectorized loops
> are fast there. Whether column arrays of Datum would be good enough,
> or whether going further toward typed packed arrays would be necessary
> to get meaningful vectorization, is a deeper design question that this
> series deliberately does not try to answer.
>
> Even though the focus is on getting batching working at the scan/TAM
> boundary first, thoughts on any of these points would be welcome.

Rebased.

--
Thanks, Amit Langote
Attachment

pgsql-hackers by date:
From: Andreas Karlsson
Date: 06 April, 15:01:41
Subject: Re: DEREF_AFTER_NULL: src/common/jsonapi.c:2529
From: Robert Haas
Date: 06 April, 15:47:30
Subject: Re: pg_plan_advice
Re: Batching in executor - Mailing list pgsql-hackers

Attachment

Previous

Next