Re: Batching in executor - Mailing list pgsql-hackers
| From | Peter Geoghegan |
|---|---|
| Subject | Re: Batching in executor |
| Date | |
| Msg-id | CAH2-WznijhPtw2vtwCtfFSwamwkT2O1KXMx6tE+eoHi3CKwRFg@mail.gmail.com Whole thread Raw |
| In response to | Re: Batching in executor (Tomas Vondra <tomas@vondra.me>) |
| Responses |
Re: Batching in executor
|
| List | pgsql-hackers |
On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <tomas@vondra.me> wrote: > While looking at the patch, I couldn't help but think about the index > prefetching stuff that I work on. It also introduces the concept of a > "batch", for passing data between an index AM and the executor. It's > interesting how different the designs are in some respects. I'm not > saying one of those designs is wrong, it's more due different goals. I've been working on a new prototype enhancement to the index prefetching patch. The new spinoff patch has index scans batch up calls to heap_hot_search_buffer for heap TIDs that the scan has yet to return. This optimization is effective whenever an index scan returns a contiguous group of TIDs that all point to the same heap page. We're able to lock and unlock heap page buffers at the same point that they're pinned and unpinned, which can dramatically decrease the number of heap buffer locks acquired by index scans that return contiguous TIDs (which is very common). I find that speedups for pgbench SELECT variants with a predicate such as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher throughput, at least in cases with low client counts (think 1 or 2 clients). These are cases where everything can fit in shared buffers, so we're not getting any benefit from I/O prefetching (in spite of the fact that this is built on top of the index prefetching patchset). It makes sense to put this in scope for the index prefetching work because that work will already give code outside of an index AM visibility into which group of TIDs need to be read next. Right now (on master) there is some trivial sense in which index AMs use their own batches, but that's completely hidden from external callers. > For example, the index prefetching patch establishes a "shared" batch > struct, and the index AM is expected to fill it with data. After that, > the batch is managed entirely by indexam.c, with no AM calls. The only > AM-specific bit in the batch is "position", but that's used only when > advancing to the next page, etc. The major difficulty with my heap batching prototype is getting the layering right (no surprises there). In some sense we're deliberately sharing information across different what we currently think of as different layers of abstraction, in order to be able to "schedule" the work more intelligently. There's a number of competing considerations. I have invented a new concept of heap batch, that is orthogonal to the existing concept of index batches. Right now these are just an array of HeapTuple structs that relate to exactly one group of group of contiguous heap TIDs (i.e. if the index scan returns TIDs even a little out of order, which is fairly common, we cannot currently reorder the work in the current prototype patch). Once a batch is prepared, calls to heapam_index_fetch_tuple just return the next TID from the batch (until the next time we have to return a TID pointing to some distinct heap block). In the case of pgbench queries like the one I mentioned, we only need to call LockBuffer/heap_hot_search_buffer once for every 61 heap tuples returned (not once per heap tuple returned). Importantly, the new interface added by my new prototype spinoff patch is higher level than the existing table_index_fetch_tuple/heapam_index_fetch_tuple interface. The executor asks the table AM "give me the next heap TID in the current scan direction", rather than asking "give me this heap TID". The general idea is that the table AM has a direct understanding of ordered index scans. The advantage of this higher-level interface is that it gives the table AM maximum freedom to reorder work. As I said already, we won't do things like merge together logically noncontiguous accesses to the same heap page into one physical access right now. But I think that that should at least be enabled by this interface. The downside of this approach is that table AM (not the executor proper) is responsible for interfacing with the index AM layer. I think that this can be generalized without very much code duplication across table AMs. But it's hard. > This patch does things differently. IIUC, each TAM may produce it's own > "batch", which is then wrapped in a generic one. For example, heap > produces HeapBatch, and it gets wrapped in TupleBatch. But I think this > is fine. In the prefetching we chose to move all this code (walking the > batch items) from the AMs into the layer above, and make it AM agnostic. I think that the base index prefetching patch's current notion of index-AM-wise batches can be kept quite separate from any table AM batch concept that might be invented, either as part of what I'm working on, or in Amit's patch. It probably wouldn't be terribly difficult to get the new interface I've described to return heap tuples in whatever batch format Amit comes up with. That only has a benefit if it makes life easier for expression evaluation in higher levels of the plan tree, but it might just make sense to always do it that way. I doubt that adopting Amit's batch format will make life much harder for the heap_hot_search_buffer-batching mechanism (at least if it is generally understood that its new index scan interface's builds batches in Amit's format on a best-effort basis). -- Peter Geoghegan
pgsql-hackers by date: