Re: Batching in executor - Mailing list pgsql-hackers
| From | Amit Langote |
|---|---|
| Subject | Re: Batching in executor |
| Date | |
| Msg-id | CA+HiwqGkway9dcpTVmRs0Q=ZsQQ8Nx0MRjy9jfDmR8QFLnd-Mg@mail.gmail.com Whole thread |
| In response to | Re: Batching in executor (Junwang Zhao <zhjwpku@gmail.com>) |
| Responses |
Re: Batching in executor
|
| List | pgsql-hackers |
Hi, Here is a significantly revised version of the patch series. A lot has changed since the January submission, so I want to summarize the design changes before getting into the patches. I think it does address the points in the two reviews that landed since v5 but maybe a bunch of points became moot after my rewrite of the relevant portions (thanks Junwang and ChangAo for the review in any case). At this point it might be better to think of this as targeting v20, except that if there is review bandwidth in the remaining two weeks before the v19 feature freeze, the rs_vistuples[] change described below as a standalone improvement to the existing pagemode scan path could be considered for v19, though that too is an optimistic scenario. It is also worth noting that Andres identified a number of inefficiencies in the existing scan path in: Re: unnecessary executor overheads around seqscans https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw that are worth fixing independently of batching. Some of those fixes may be better pursued first, both because they benefit all scan paths and because they would make batching's gains more honest. Separately, after looking at the previous version, Andres pointed out offlist two fundamental issues with the patch's design: * The heapam implementation (in a version of the patch I didn't post to the thread) duplicated heap_prepare_pagescan() logic in a separate batch-specific code path, which is not acceptable as changes should benefit the existing slot interface too. Code duplication is not good either from a future maintainability aspect. The v5 version of that code is not great in that respect either; it instead duplicated heapggettup_pagemode() to slap batching on it. * Allocating executor_batch_rows slots on the executor side to receive rows from the AM adds significant overhead for slot initialization and management, and for non-row-organized AMs that do not produce individual rows at all, those slots would never be meaningfully populated. In any case, he just wasn't a fan of the slot-array approach the moment I mentioned it. The previous version had two slot arrays, inslots and outslots, of TTSOpsHeapTuple type (not TTSOpsBufferHeapTuple because buffer pins were managed by the batch code, which has its own modularity/correctness issues), populated via a materialize_all callback. A batch qual evaluator would copy qualifying tuples into outslots, with an activeslots pointer switching between the two depending on whether batch qual evaluation was used. The new design addresses both issues and differs from the previous version in several other ways: * Single slot instead of slot arrays: there is a single TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type was already determined by the AM via table_slot_callbacks(). The slot is re-pointed to each HeapTuple in the current buffer page via a new repoint_slot AM callback, with no materialization or copying. Tuples are returned one by one from the executor's perspective, but the AM serves them in page-sized batches from pre-built HeapTupleData descriptors in rs_vistuples[], avoiding repeated descent into heapam per tuple. This is heapam's implementation of the batch interface; there is no intention to force other AMs into the same row-oriented model. * Batch qual evaluator not included: with the single-slot model, quals are evaluated per tuple via the existing ExecQual path after each repoint_slot call. A natural next step would be a new opcode (EEOP) that calls repoint_slot() internally within expression evaluation, allowing ExecQual to advance through multiple tuples from the same batch without returning to the scan node each time, with qual results accumulated in a bitmask in ExprState. The details of that will be worked out in a follow-on series. * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which allows heap_getnextbatch() to simply advance a slice pointer into that array without any additional copying or re-entering heap code, making a separate batch-specific scan function unnecessary. * TupleBatch renamed to RowBatch: "row batch" is more natural terminology for this concept and also consistent with how similar abstractions are named in columnar and OLAP systems. * AM callbacks now take RowBatch directly: previously heap_getnextbatch() returned a void pointer that the executor would store into RowBatch.am_payload, because only the executor knew the internals of RowBatch. Now the AM receives RowBatch directly as a parameter and can populate it without the executor acting as an intermediary. This is also why RowBatch is introduced in its own patch ahead of the AM API addition, so the struct definition is available to both sides. Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries instead of OffsetNumbers, as a standalone improvement to the existing pagemode scan path. Measured on a pg_prewarm'd (also vaccum freeze'd in the all-visible case) table with 1M/5M/10M rows: query all-visible not-all-visible count(*) -0.2% to +0.9% -0.4% to +0.5% count(*) WHERE id % 10 = 0 -1.1% to +3.4% +0.2% to +1.5% SELECT * LIMIT 1 OFFSET N -2.2% to -0.6% -0.9% to +6.6% SELECT * WHERE id%10=0 LIMIT -0.8% to +3.9% +0.9% to +9.6% No significant regression on either page type. The structural improvement is most visible on not-all-visible pages where HeapTupleSatisfiesMVCCBatch() already reads every tuple header during visibility checks, so persisting the result into rs_vistuples[] eliminates the downstream re-read (in heapgettupe_pagemode()) with no measurable overhead. That said, these numbers are somewhat noisy on my machine. Results on other machines would be welcome. Patches 0002-0005 add the RowBatch infrastructure, the batch AM API and heapam implementation including seqscan variants that use the new scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support, respectively. With batching enabled (executor_batch_rows=300, ~MaxHeapTuplesPerPage): query all-visible not-all-visible count(*) +11 to +15% +9 to +13% count(*) WHERE id % 10 = 0 +6 to +11% +10 to +14% SELECT * LIMIT 1 OFFSET N +16 to +19% +16 to +22% SELECT * WHERE id%10=0 LIMIT +8 to +10% +8 to +13% With executor_batch_rows=0, results are within noise of master across all query types and sizes, confirming no regression from the infrastructure changes themselves. The not-all-visible results tend to show slightly higher gains than the all-visible case. This is likely because the existing heapam code is more optimized for the all-visible path, so the not-all-visible path, which goes through HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has more headroom that batching can exploit. Setting aside the current series for a moment, there are some broader design questions worth raising while we have attention on this area. Some of these echo points Tomas raised in his first reply on this thread, and I am reiterating them deliberately since I have not managed to fully address them on my own or I simply didn't need to for the TAM-to-scan-node batching and think they would benefit from wider input rather than just my own iteration. We should also start thinking about other ways the executor can consume batch rows, not always assuming they are presented as HeapTupleData. For instance, an AM could expose decoded column arrays directly to operators that can consume them, bypassing slot-based deform entirely, or a columnar AM could implement scan_getnextbatch by decoding column strips directly into the batch without going through per-tuple HeapTupleData at all. Feedback on whether the current RowBatch design and the choices made in the scan_getnextbatch and RowBatchOps API make that sort of thing harder than it needs to be would be appreciated. For example, heapam's implementation of scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed to HeapTupleData entries one at a time via repoint_slot in RowBatchHeapOps. That works for heapam but a columnar AM could implement scan_getnextbatch to decode column strips directly into arrays in the batch, with no per-row repoint step needed at all. Any adjustments that would make RowBatch more AM-agnostic are worth discussing now before the design hardens. There are also broader open questions about how far the batch model can extend beyond the scan node. Qual pushdown into the AM has been discussed in nearby threads and would be one way to allow expression evaluation to happen before data reaches the executor proper, though that is a separate effort. For the purposes of this series, expression evaluation still happens in the executor after scan_getnextbatch returns. If the scan node does not project, the buffer heap slot is passed directly to the parent node, which calls slot callbacks to deform as needed. But once a node above projects, aggregates, or joins, the notion of a page-sized batch from a single AM loses its meaning and virtual slots take over. Whether RowBatch is usable or meaningful beyond the scan/TAM boundary in any form, and whether the core executor will ever have non-HeapTupleData batch consumption paths or leave that entirely to extensions, are open questions worth discussing. For RowBatch to eventually play the role that TupleTableSlot plays for row-at-a-time execution, something inside it would need to serve as the common currency for batch data, analogous to TupleTableSlot's datum/isnull arrays. Column arrays are the obvious direction, but even that leaves open the question of representation. PostgreSQL's Datum is a pointer-sized abstraction that boxes everything, whereas vectorized systems use typed packed arrays of native types with validity bitmasks, which is a significant part of why tight vectorized loops are fast there. Whether column arrays of Datum would be good enough, or whether going further toward typed packed arrays would be necessary to get meaningful vectorization, is a deeper design question that this series deliberately does not try to answer. Even though the focus is on getting batching working at the scan/TAM boundary first, thoughts on any of these points would be welcome. -- Thanks, Amit Langote
Attachment
- v6-0003-Add-batch-table-AM-API-and-heapam-implementation.patch
- v6-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch
- v6-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch
- v6-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch
- v6-0004-SeqScan-add-batch-driven-variants-returning-slots.patch
pgsql-hackers by date: