Re: index prefetching - Mailing list pgsql-hackers

From Amit Langote
Subject Re: index prefetching
Date
Msg-id CA+HiwqFrT-fpZuj7hAY1uujN-jp31wS-LB=AtryT=mCOjGkj+w@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Fri, Dec 5, 2025 at 6:11 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Dec 4, 2025 at 12:54 AM Amit Langote <amitlangote09@gmail.com> wrote:
> > I want to acknowledge that figuring out the right layering to make I/O
> > prefetching and perhaps other optimizations internal to IndexNext()
> > work is obviously the priority right now, regardless of the output
> > format used to populate the slots ultimately returned by
> > table_index_getnext_slot().
>
> Right; table_index_getnext_slot simply returns a tuple into the
> caller's slot. That's almost the same as the existing getnext_slot
> interface used by those same call sites on the master branch, except
> that in the patch we're directly calling a table AM callback/heapam
> specific implementation (not code in indexam.c).
>
> The new heapam implementation heapam_index_getnext_slot applies more
> high-level context about ordered index scans, which enables it to
> reorder work quite freely, even when it is work that takes place in
> index AMs.
>
> > However, regarding your question about
> > "painting ourselves into a corner":
> >
> > In my executor batching work (which has focused on Seq Scans), the
> > HeapBatch is essentially just a pinned buffer plus an array of
> > pre-allocated tuple headers. I hadn't strictly considered creating a
> > HeapBatch to return from Index Scans, largely because
> > heap_hot_search_buffer() is designed for scalar (or non-batched)
> > access that requires repeated buffer locking.
> >
> > But it seems like the eventual goal of batching calls to
> > heap_hot_search_buffer() effectively clears that hurdle.
>
> Actually, that's not the eventual goal anymore; now we're treating it
> as our *immediate* goal, at least in terms of things that will have
> user-visible impact (as opposed to API changes needed to facilitate
> batching type optimizations in the future, including I/O prefetching).

That makes sense. I had thought the "vectorized HOT search" (batching
heap_hot_search_buffer) was a distant goal, so pulling it into PG19 is
great news. It means the internal mechanics for "group-by-page"
access, which seems like the hardest part of batching an index scan,
will be in place sooner rather than later, allowing us to share some
bits related to batching the output.

> But the
> heap_hot_search_buffer thing definitely is in scope for Postgres 19
> (if we're going to make all these API changes then it seems best to
> give users an immediate benefit).

+1

> > As long as
> > the internal logic separates the "grouping/locking" from the
> > "materializing into a slot," it seems this design does not prevent us
> > from eventually wiring up a table_index_getnext_batch() to populate
> > the HeapBatch structure I am proposing for the regular non-index scan
> > path (table_scan_getnextbatch() in my patch).
>
> That's good.
>
> Suppose we do a much more advanced version of the kind of work
> reordering that the heap_hot_search_buffer thing will do for Postgres
> 19. I described this to Tomas in my last email to this thread, when I
> said:
>
> """
> We could even do something much more sophisticated than what I
> actually have planned for 19: we could reorder table fetches, such
> that we only had to lock and pin each heap page exactly once *even
> when the TIDs returned by the index scan return TIDs slightly out of
> order*. For example, if an index page/batch returns TIDs "(1,1),
> (2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1
> and 2 by locking and pinning each of those 2 pages exactly once. The
> only downside (other than the complexity) is that we'd sometimes hold
> multiple heap page pins at a time, not just one.
> """
>
> (To be clear this more advanced version is definitely out of scope for
> Postgres 19.)
>
> We'd be holding on to multiple buffer pins at a time (across calls to
> heapam_index_getnext_slot) were we to do this more advanced
> optimization. I *think* that still means that the design/internal
> logic will (as you put it) "separate the 'grouping/locking' from the
> 'materializing into a slot'". That's just the only way that could
> possibly work correctly, at least with heapam.

Agreed. Even if a future version holds multiple pins to handle
out-of-order TIDs, the architectural separation holds. The TAM would
just populate a batch that spans those multiple pinned buffers (or a
more complex batch structure), but the interface between it and the
executor remains the same.

> It makes sense for us both to (at a minimum) have at least some
> general awareness of each other's goals. I really only want to avoid
> completely gratuitous incompatibilities/conflicts. For example, if you
> invent a new slot-like mechanism in the executor that can return
> multiple tuples in one go, then it seems like we should probably try
> to use that in our own work on batching. If we're already assembling
> the information in a way that almost works with that new interface,
> why wouldn't we make sure that it actually worked with and used that
> new interface directly?
>
> It doesn't sound like there'd be many disagreements on how that would
> have to work, since the requirements are largely dictated by existing
> constraints that we're both already naturally subject to. For example:
>
> * We need to hold on to a buffer pin on a heap page if one of its heap
> tuples is contained in a slot/something slot-like. For as long as
> there's any chance that somebody will examine that heap tuple (until
> the slot releases the tuple).
>
> * Buffer locks must only be acquired by lower-level access method
> code, for very short periods, and never in a way that requires
> coordination across module boundaries.

Your list of constraints matches my experience with making batches for
Seq Scans.

My current HeapBatch implementation for Seq Scans received in a
TupleBatch executor container is designed exactly around that first
point. It fills the batch from the currently pinned page, stopping
either when the batch capacity (GUC in the next version of my patch)
is reached or when we reach the end of the page. I deliberately avoid
spanning multiple pages in a single batch (for now) to keep that pin
management simple.

From the executor's perspective, HeapBatch is just an opaque black
box. I rely on the TAM to manage the underlying resources (pins) to
keep the data valid. This aligns well with your work because it leaves
the pin management strategy, whether single-page or multi-page,
entirely up to the AM implementation without exposing those details to
the scan node.

> It sounds like the potential for conflicts between each other's work
> will be absolutely minimal. It seems as if we don't even have to agree
> on anything new or novel.
>
> > Sorry to hijack the thread, but just wanted to confirm I haven't
> > misunderstood the architectural implications for future batching.
>
> I don't think that you've hijacked anything. Your input is more than welcome.

Thanks. It would be nice to keep this channel open as your API
evolves. Based on what I understand so far, the heapam internals of
this work seem compatible with how I am populating "output" batches in
my executor work, so I just want to ensure we don't accidentally
diverge on that front.

--
Thanks, Amit Langote



pgsql-hackers by date:

Previous
From: Matthias van de Meent
Date:
Subject: Re: making tid and HOTness of UPDATE available to logical decoding plugins
Next
From: Matthias van de Meent
Date:
Subject: Re: making tid and HOTness of UPDATE available to logical decoding plugins