Re: Batching in executor - Mailing list pgsql-hackers
| From | Amit Langote |
|---|---|
| Subject | Re: Batching in executor |
| Date | |
| Msg-id | CA+HiwqHLjBegqeUw-babABj_icAiBTq5gh48-D3BEodVZkm8Ug@mail.gmail.com Whole thread |
| In response to | Re: Batching in executor (Amit Langote <amitlangote09@gmail.com>) |
| List | pgsql-hackers |
On Tue, Mar 24, 2026 at 9:59 AM Amit Langote <amitlangote09@gmail.com> wrote: > Here is a significantly revised version of the patch series. A lot has > changed since the January submission, so I want to summarize the > design changes before getting into the patches. I think it does > address the points in the two reviews that landed since v5 but maybe a > bunch of points became moot after my rewrite of the relevant portions > (thanks Junwang and ChangAo for the review in any case). > > At this point it might be better to think of this as targeting v20, > except that if there is review bandwidth in the remaining two weeks > before the v19 feature freeze, the rs_vistuples[] change described > below as a standalone improvement to the existing pagemode scan path > could be considered for v19, though that too is an optimistic > scenario. > > It is also worth noting that Andres identified a number of > inefficiencies in the existing scan path in: > > Re: unnecessary executor overheads around seqscans > https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw > > that are worth fixing independently of batching. Some of those fixes > may be better pursued first, both because they benefit all scan paths > and because they would make batching's gains more honest. > > Separately, after looking at the previous version, Andres pointed out > offlist two fundamental issues with the patch's design: > > * The heapam implementation (in a version of the patch I didn't post > to the thread) duplicated heap_prepare_pagescan() logic in a separate > batch-specific code path, which is not acceptable as changes should > benefit the existing slot interface too. Code duplication is not good > either from a future maintainability aspect. The v5 version of that > code is not great in that respect either; it instead duplicated > heapggettup_pagemode() to slap batching on it. > > * Allocating executor_batch_rows slots on the executor side to receive > rows from the AM adds significant overhead for slot initialization and > management, and for non-row-organized AMs that do not produce > individual rows at all, those slots would never be meaningfully > populated. > > In any case, he just wasn't a fan of the slot-array approach the > moment I mentioned it. The previous version had two slot arrays, > inslots and outslots, of TTSOpsHeapTuple type (not > TTSOpsBufferHeapTuple because buffer pins were managed by the batch > code, which has its own modularity/correctness issues), populated via > a materialize_all callback. A batch qual evaluator would copy > qualifying tuples into outslots, with an activeslots pointer switching > between the two depending on whether batch qual evaluation was used. > > The new design addresses both issues and differs from the previous > version in several other ways: > > * Single slot instead of slot arrays: there is a single > TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type > was already determined by the AM via table_slot_callbacks(). The slot > is re-pointed to each HeapTuple in the current buffer page via a new > repoint_slot AM callback, with no materialization or copying. Tuples > are returned one by one from the executor's perspective, but the AM > serves them in page-sized batches from pre-built HeapTupleData > descriptors in rs_vistuples[], avoiding repeated descent into heapam > per tuple. This is heapam's implementation of the batch interface; > there is no intention to force other AMs into the same row-oriented > model. > > * Batch qual evaluator not included: with the single-slot model, > quals are evaluated per tuple via the existing ExecQual path after > each repoint_slot call. A natural next step would be a new opcode > (EEOP) that calls repoint_slot() internally within expression > evaluation, allowing ExecQual to advance through multiple tuples from > the same batch without returning to the scan node each time, with qual > results accumulated in a bitmask in ExprState. The details of that > will be worked out in a follow-on series. > > * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes > HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which > allows heap_getnextbatch() to simply advance a slice pointer into that > array without any additional copying or re-entering heap code, making > a separate batch-specific scan function unnecessary. > > * TupleBatch renamed to RowBatch: "row batch" is more natural > terminology for this concept and also consistent with how similar > abstractions are named in columnar and OLAP systems. > > * AM callbacks now take RowBatch directly: previously > heap_getnextbatch() returned a void pointer that the executor would > store into RowBatch.am_payload, because only the executor knew the > internals of RowBatch. Now the AM receives RowBatch directly as a > parameter and can populate it without the executor acting as an > intermediary. This is also why RowBatch is introduced in its own > patch ahead of the AM API addition, so the struct definition is > available to both sides. > > Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries > instead of OffsetNumbers, as a standalone improvement to the existing > pagemode scan path. Measured on a pg_prewarm'd (also vaccum freeze'd > in the all-visible case) table with 1M/5M/10M rows: > > query all-visible not-all-visible > count(*) -0.2% to +0.9% -0.4% to +0.5% > count(*) WHERE id % 10 = 0 -1.1% to +3.4% +0.2% to +1.5% > SELECT * LIMIT 1 OFFSET N -2.2% to -0.6% -0.9% to +6.6% > SELECT * WHERE id%10=0 LIMIT -0.8% to +3.9% +0.9% to +9.6% > > No significant regression on either page type. The structural > improvement is most visible on not-all-visible pages where > HeapTupleSatisfiesMVCCBatch() already reads every tuple header during > visibility checks, so persisting the result into rs_vistuples[] > eliminates the downstream re-read (in heapgettupe_pagemode()) with no > measurable overhead. That said, these numbers are somewhat noisy on > my machine. Results on other machines would be welcome. > > Patches 0002-0005 add the RowBatch infrastructure, the batch AM API > and heapam implementation including seqscan variants that use the new > scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support, > respectively. With batching enabled (executor_batch_rows=300, > ~MaxHeapTuplesPerPage): > > query all-visible not-all-visible > count(*) +11 to +15% +9 to +13% > count(*) WHERE id % 10 = 0 +6 to +11% +10 to +14% > SELECT * LIMIT 1 OFFSET N +16 to +19% +16 to +22% > SELECT * WHERE id%10=0 LIMIT +8 to +10% +8 to +13% > > With executor_batch_rows=0, results are within noise of master across > all query types and sizes, confirming no regression from the > infrastructure changes themselves. The not-all-visible results tend > to show slightly higher gains than the all-visible case. This is > likely because the existing heapam code is more optimized for the > all-visible path, so the not-all-visible path, which goes through > HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has > more headroom that batching can exploit. > > Setting aside the current series for a moment, there are some broader > design questions worth raising while we have attention on this area. > Some of these echo points Tomas raised in his first reply on this > thread, and I am reiterating them deliberately since I have not > managed to fully address them on my own or I simply didn't need to for > the TAM-to-scan-node batching and think they would benefit from wider > input rather than just my own iteration. > > We should also start thinking about other ways the executor can > consume batch rows, not always assuming they are presented as > HeapTupleData. For instance, an AM could expose decoded column arrays > directly to operators that can consume them, bypassing slot-based > deform entirely, or a columnar AM could implement scan_getnextbatch by > decoding column strips directly into the batch without going through > per-tuple HeapTupleData at all. Feedback on whether the current > RowBatch design and the choices made in the scan_getnextbatch and > RowBatchOps API make that sort of thing harder than it needs to be > would be appreciated. For example, heapam's implementation of > scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed > to HeapTupleData entries one at a time via repoint_slot in > RowBatchHeapOps. That works for heapam but a columnar AM could > implement scan_getnextbatch to decode column strips directly into > arrays in the batch, with no per-row repoint step needed at all. Any > adjustments that would make RowBatch more AM-agnostic are worth > discussing now before the design hardens. > > There are also broader open questions about how far the batch model > can extend beyond the scan node. Qual pushdown into the AM has been > discussed in nearby threads and would be one way to allow expression > evaluation to happen before data reaches the executor proper, though > that is a separate effort. For the purposes of this series, expression > evaluation still happens in the executor after scan_getnextbatch > returns. If the scan node does not project, the buffer heap slot is > passed directly to the parent node, which calls slot callbacks to > deform as needed. But once a node above projects, aggregates, or > joins, the notion of a page-sized batch from a single AM loses its > meaning and virtual slots take over. Whether RowBatch is usable or > meaningful beyond the scan/TAM boundary in any form, and whether the > core executor will ever have non-HeapTupleData batch consumption paths > or leave that entirely to extensions, are open questions worth > discussing. > > For RowBatch to eventually play the role that TupleTableSlot plays for > row-at-a-time execution, something inside it would need to serve as > the common currency for batch data, analogous to TupleTableSlot's > datum/isnull arrays. Column arrays are the obvious direction, but even > that leaves open the question of representation. PostgreSQL's Datum is > a pointer-sized abstraction that boxes everything, whereas vectorized > systems use typed packed arrays of native types with validity > bitmasks, which is a significant part of why tight vectorized loops > are fast there. Whether column arrays of Datum would be good enough, > or whether going further toward typed packed arrays would be necessary > to get meaningful vectorization, is a deeper design question that this > series deliberately does not try to answer. > > Even though the focus is on getting batching working at the scan/TAM > boundary first, thoughts on any of these points would be welcome. Rebased. -- Thanks, Amit Langote
Attachment
- v7-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch
- v7-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch
- v7-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch
- v7-0003-Add-batch-table-AM-API-and-heapam-implementation.patch
- v7-0004-SeqScan-add-batch-driven-variants-returning-slots.patch
pgsql-hackers by date: