Home > mailing lists
Re: Batching in executor - Mailing list pgsql-hackers

From	Amit Langote
Subject	Re: Batching in executor
Date	October 28 16:11:46
Msg-id	CA+HiwqHtoQ5Zzf0R2PqkAfKrPr+_taeudE5t2qDsesAscmuujg@mail.gmail.com Whole thread Raw
In response to	Re: Batching in executor (Peter Geoghegan <pg@bowt.ie>)
List	pgsql-hackers
Tree view
Hi Peter,

Thanks for chiming in here.

On Tue, Oct 28, 2025 at 2:37 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <tomas@vondra.me> wrote:
> > While looking at the patch, I couldn't help but think about the index
> > prefetching stuff that I work on. It also introduces the concept of a
> > "batch", for passing data between an index AM and the executor. It's
> > interesting how different the designs are in some respects. I'm not
> > saying one of those designs is wrong, it's more due different goals.
>
> I've been working on a new prototype enhancement to the index
> prefetching patch. The new spinoff patch has index scans batch up
> calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
> return. This optimization is effective whenever an index scan returns
> a contiguous group of TIDs that all point to the same heap page. We're
> able to lock and unlock heap page buffers at the same point that
> they're pinned and unpinned, which can dramatically decrease the
> number of heap buffer locks acquired by index scans that return
> contiguous TIDs (which is very common).
>
> I find that speedups for pgbench SELECT variants with a predicate such
> as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
> throughput, at least in cases with low client counts (think 1 or 2
> clients). These are cases where everything can fit in shared buffers,
> so we're not getting any benefit from I/O prefetching (in spite of the
> fact that this is built on top of the index prefetching patchset).

I gathered from the index prefetching thread that it is mainly about
enabling I/O prefetching, so it's nice to see that kind of speedup
even for the in-memory case.

Is this spinoff patch separate from the one that adds amgetbatch() to
IndexAmRoutine which you posted on Oct 12? If so, where can I find it?

> It makes sense to put this in scope for the index prefetching work
> because that work will already give code outside of an index AM
> visibility into which group of TIDs need to be read next. Right now
> (on master) there is some trivial sense in which index AMs use their
> own batches, but that's completely hidden from external callers.

As you might know, heapam's TableAmRoutine.scan_* functions use a
"pagemode" in some cases, which fills a batch of tuples in
HeapScanData.rs_vistuples. However, that batch currently only stores
the tuples’ offset numbers. I started this work based on Andres’s
suggestion to propagate that batch up into the executor’s scan nodes.
The idea is to create a HeapTuple array sized according to the
executor’s batch size, and then populate it when the scan node calls
the new TableAmRoutine.scan_batch* variant. There might be some
overlap between our respective ideas.

> > For example, the index prefetching patch establishes a "shared" batch
> > struct, and the index AM is expected to fill it with data. After that,
> > the batch is managed entirely by indexam.c, with no AM calls. The only
> > AM-specific bit in the batch is "position", but that's used only when
> > advancing to the next page, etc.
>
> The major difficulty with my heap batching prototype is getting the
> layering right (no surprises there). In some sense we're deliberately
> sharing information across different what we currently think of as
> different layers of abstraction, in order to be able to "schedule" the
> work more intelligently. There's a number of competing considerations.
>
> I have invented a new concept of heap batch, that is orthogonal to the
> existing concept of index batches. Right now these are just an array
> of HeapTuple structs that relate to exactly one group of group of
> contiguous heap TIDs (i.e. if the index scan returns TIDs even a
> little out of order, which is fairly common, we cannot currently
> reorder the work in the current prototype patch).
>
> Once a batch is prepared, calls to heapam_index_fetch_tuple just
> return the next TID from the batch (until the next time we have to
> return a TID pointing to some distinct heap block). In the case of
> pgbench queries like the one I mentioned, we only need to call
> LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
> returned (not once per heap tuple returned).
>
> Importantly, the new interface added by my new prototype spinoff patch
> is higher level than the existing
> table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
> executor asks the table AM "give me the next heap TID in the current
> scan direction", rather than asking "give me this heap TID". The
> general idea is that the table AM has a direct understanding of
> ordered index scans.
>
> The advantage of this higher-level interface is that it gives the
> table AM maximum freedom to reorder work. As I said already, we won't
> do things like merge together logically noncontiguous accesses to the
> same heap page into one physical access right now. But I think that
> that should at least be enabled by this interface.

Interesting. It sounds like you aim to replace the fetch_tuple
interface with a more generic one, is that right?

> The downside of this approach is that table AM (not the executor
> proper) is responsible for interfacing with the index AM layer. I
> think that this can be generalized without very much code duplication
> across table AMs. But it's hard.

Seems so.

> > This patch does things differently. IIUC, each TAM may produce it's own
> > "batch", which is then wrapped in a generic one. For example, heap
> > produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> > is fine. In the prefetching we chose to move all this code (walking the
> > batch items) from the AMs into the layer above, and make it AM agnostic.
>
> I think that the base index prefetching patch's current notion of
> index-AM-wise batches can be kept quite separate from any table AM
> batch concept that might be invented, either as part of what I'm
> working on, or in Amit's patch.
>
> It probably wouldn't be terribly difficult to get the new interface
> I've described to return heap tuples in whatever batch format Amit
> comes up with. That only has a benefit if it makes life easier for
> expression evaluation in higher levels of the plan tree, but it might
> just make sense to always do it that way. I doubt that adopting Amit's
> batch format will make life much harder for the
> heap_hot_search_buffer-batching mechanism (at least if it is generally
> understood that its new index scan interface's builds batches in
> Amit's format on a best-effort basis).

In my implementation, the new TableAmRoutine.scan_getnextbatch()
returns a batch as an opaque table AM structure, which can then be
passed up to the upper levels of the plan. Patch 0001 in my series
adds the following to the TableAmRoutine API:

+    /* ------------------------------------------------------------------------
+     * Batched scan support
+     * ------------------------------------------------------------------------
+     */
+
+    void       *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+    int         (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+                                     ScanDirection dir);
+    void        (*scan_end_batch)(TableScanDesc sscan, void *am_batch);

I haven't seen what your version looks like, but if it is compatible
with the above, I'd be happy to adopt a batch format that accommodates
multiple use cases.

--
Thanks, Amit Langote
pgsql-hackers by date:
From: Michael Paquier
Date: 28 October, 16:10:08
Subject: Re: Consistently use the XLogRecPtrIsInvalid() macro
From: Maxime Schoemans
Date: 28 October, 16:28:30
Subject: Re: [PATCH] Check that index can return in get_actual_variable_range()
Re: Batching in executor - Mailing list pgsql-hackers

Previous

Next