Re: index prefetching - Mailing list pgsql-hackers

From Andres Freund
Subject Re: index prefetching
Date
Msg-id chsvntdxvsiyigxq4nng36gne4natvxwvsqnkvbjlpaw6bu7co@a6togdo4wbrj
Whole thread
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: index prefetching
List pgsql-hackers
Hi,

On 2026-03-31 13:39:30 -0400, Peter Geoghegan wrote:
> On Fri, Mar 27, 2026 at 8:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v18, which addresses the tricky issues I skipped in v17.
>
> v19 is attached. It contains a few notable improvements over v18:
>
> * The first commit (the one that simply moves code into the new
> heapam_indexscan.c file) now moves heap_hot_search_buffer over there
> as well.
>
> Andres suggested this at one point, and I believe it wins us a small
> but significant performance benefit.

To me it also just makes sense.


> From 742dcbea7d184443d2c4ef3c095b60f47f870f72 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Wed, 25 Mar 2026 00:22:17 -0400
> Subject: [PATCH v19 01/18] Move heapam_handler.c index scan code to new file.
>
> Move the heapam index fetch callbacks (index_fetch_begin,
> index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new
> dedicated file.  Also move heap_hot_search_buffer over.  This is a
> purely mechanical move with no functional impact.

I don't love that this renames arguments (s/call_again/heap_continue/) while
moving. Makes it harder to verify this is just the mechanical change it claims
to be.


> +/*-------------------------------------------------------------------------
> + *
> + * heapam_indexscan.c
> + *      heap table plain index scan and index-only scan code
> + *
> + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *      src/backend/access/heap/heapam_indexscan.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/heapam.h"
> +#include "storage/predicate.h"

Should probably include access/relscan.h for +IndexFetchTableData, rather than
relying on the indirect include.



> From 5ca838ac32a8d43510325cb400be4ecea47ea8d2 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Sun, 22 Mar 2026 02:36:57 -0400
> Subject: [PATCH v19 02/18] Add slot-based table AM index scan interface.
>
> Add table_index_getnext_slot, a new table AM callback that wraps both
> plain and index-only index scans that use amgettuple.  Two new
> TableAmRoutine callbacks are introduced -- one for plain scans and one
> for index-only scans -- which an upcoming commit that adds the
> amgetbatch interface will expand to four.  The appropriate callback is
> resolved once in index_beginscan, and called through a function pointer
> (xs_getnext_slot) on the IndexScanDesc when the table_index_getnext_slot
> shim function is called from executor nodes.
>
> This moves VM checks for index-only scans out of the executor and into
> heapam, enabling batching of visibility map lookups (though for now we
> continue to just perform retail lookups).

I'd also add that it's architecturally much cleaner this way.


> A small minority of callers (2 callers in total) continue to use the old
> table_index_fetch_tuple interface.  This is still necessary with callers
> that fundamentally need to pass a TID to the table AM.  Both callers
> perform some kind of constraint enforcement.

I think we may need to do something about this before all of this over. It's
just too confusing to have the generic sounding index_fetch_tuple() interfaces
that are barely ever used and *shouldn't* be used when, uhm, fetching a tuple
pointed to by an index.

At the very least it seems we should rename the ->index_fetch_tuple() to
->index_fetch_tid() or something and remove the call_again, all_dead
arguments, since they are never used in this context.  I suspect it should
*not* use the table_index_fetch_begin()/table_index_fetch_end() interface
either.


> XXX Do we still need IndexFetchTableData.rel?  We are now mostly using
> IndexScanDescData.heapRelation, but we could use it for absolutely
> everything if we were willing to revise the signature of the old
> table_index_fetch_tuple interface by giving it its own heapRelation arg.

I think we *should* change the argument of table_index_fetch_tuple().


> Index-only scan callers pass table_index_getnext_slot a TupleTableSlot
> (which the table AM needs internally for heap fetches), but continue to
> read their results from IndexScanDescData fields such as xs_itup (rather
> than from the slot itself).

Pretty this ain't.  But I guess it's been vaguely gross like this for quite a
while, so...


> This convention lets both plain and index-only scans use the same
> table_index_getnext_slot interface.

Not convinced that's really a useful goal.


> All callers can continue to rely on the scan descriptor's xs_heaptid field
> being set on each call. (execCurrentOf is the only caller that reads this
> field directly, to determine the current row of an index-only scan, but it
> seems like a good idea to do the same thing for all callers).

> XXX Should execCurrentOf continue to do this?  Can't it get the same TID
> from the table slot instead?

I think the tid from the slot is always the tid from the start of the HOT
chain, not the actual location of the tuple. That's important, as otherwise a
HOT pruning after determining the slot's tid would potentially make the tid
invalid.  Whereas, I think, xs_heaptid, is currently set to the offset of the
actual tuple, even if it's a just a HOT version.


> @@ -177,10 +181,13 @@ typedef struct IndexScanDescData
>      struct TupleDescData *xs_hitupdesc; /* rowtype descriptor of xs_hitup */
>
>      ItemPointerData xs_heaptid; /* result */
> -    bool        xs_heap_continue;    /* T if must keep walking, potential
> -                                     * further results */
>      IndexFetchTableData *xs_heapfetch;
>
> +    /* Resolved getnext_slot implementation, set by index_beginscan */
> +    bool        (*xs_getnext_slot) (struct IndexScanDescData *scan,
> +                                    ScanDirection direction,
> +                                    struct TupleTableSlot *slot);
> +
>      bool        xs_recheck;        /* T means scan keys must be rechecked */
>
>      /*

Hm. Why is it ok that we now don't have an equivalent of xs_heap_continue on
the scan level anymore?  I see there's a local heap_continue in
heapam_index_getnext_slot(), but isn't that insufficient e.g. for a
SNAPSHOT_ANY snapshot, where all the row versions would be visible?

Am I misunderstanding how this works?



> +/*
> + * Fetch the next tuple from an index scan into `slot`, scanning in the
> + * specified direction.  Returns true if a tuple satisfying the scan keys and
> + * the snapshot was found, false otherwise.  The tuple is stored in the
> + * specified slot.
> + *
> + * Dispatches through scan->xs_getnext_slot, which is resolved once by
> + * index_beginscan.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will be
> + * released by a future table_index_getnext_slot or index_endscan call.
> + *
> + * Note: caller must check scan->xs_recheck, and perform rechecking of the
> + * scan keys if required.  We do not do that here because we don't have
> + * enough information to do it efficiently in the general case.
> + *
> + * For index-only scans, the callback also fills xs_itup/xs_itupdesc or
> + * xs_hitup/xs_hitupdesc (or both) so that index data can be returned without
> + * a heap fetch.
> + */
> +static inline bool
> +table_index_getnext_slot(IndexScanDesc iscan, ScanDirection direction,
> +                         TupleTableSlot *slot)
> +{
> +    return iscan->xs_getnext_slot(iscan, direction, slot);
> +}

Hm. Does this actually belong in tableam.h? I'm a bit conflicted.


> @@ -2553,6 +2554,8 @@ static const TableAmRoutine heapam_methods = {
>      .index_fetch_begin = heapam_index_fetch_begin,
>      .index_fetch_reset = heapam_index_fetch_reset,
>      .index_fetch_end = heapam_index_fetch_end,
> +    .index_plain_amgettuple_getnext_slot = heapam_index_plain_amgettuple_getnext_slot,
> +    .index_only_amgettuple_getnext_slot = heapam_index_only_amgettuple_getnext_slot,
>      .index_fetch_tuple = heapam_index_fetch_tuple,
>
>      .tuple_insert = heapam_tuple_insert,

Wonder if it's perhaps worth deleting _slot and shortening getnext to
next. These are quite only names...


> +/* table_index_getnext_slot callback: amgettuple, plain index scan */
> +pg_attribute_hot bool
> +heapam_index_plain_amgettuple_getnext_slot(IndexScanDesc scan,
> +                                           ScanDirection direction,
> +                                           TupleTableSlot *slot)
> +{
> +    Assert(!scan->xs_want_itup);
> +    Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> +    return heapam_index_getnext_slot(scan, direction, slot, false);
> +}
> +
> +/* table_index_getnext_slot callback: amgettuple, index-only scan */
> +pg_attribute_hot bool
> +heapam_index_only_amgettuple_getnext_slot(IndexScanDesc scan,
> +                                          ScanDirection direction,
> +                                          TupleTableSlot *slot)
> +{
> +    Assert(scan->xs_want_itup);
> +    Assert(scan->indexRelation->rd_indam->amgettuple != NULL);
> +
> +    return heapam_index_getnext_slot(scan, direction, slot, true);
> +}
> +
>  bool
>  heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>                           ItemPointer tid,
> @@ -233,6 +274,18 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>                           TupleTableSlot *slot,
>                           bool *heap_continue, bool *all_dead)
>  {
> +
> +    return heapam_index_fetch_tuple_impl(scan, tid, snapshot, slot,
> +                                         heap_continue, all_dead);
> +}
> +
> +static pg_attribute_always_inline bool
> +heapam_index_fetch_tuple_impl(struct IndexFetchTableData *scan,
> +                              ItemPointer tid,
> +                              Snapshot snapshot,
> +                              TupleTableSlot *slot,
> +                              bool *heap_continue, bool *all_dead)
> +{
>      IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
>      BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
>      bool        got_heap_tuple;
> @@ -289,3 +342,172 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
>
>      return got_heap_tuple;
>  }
> +
> +/*
> + * Common implementation for both heapam_index_*_getnext_slot variants.
> + *
> + * The result is true if a tuple satisfying the scan keys and the snapshot was
> + * found, false otherwise.  The tuple is stored in the specified slot.
> + *
> + * On success, resources (like buffer pins) are likely to be held, and will be
> + * dropped by a future call here (or by a later call to heapam_index_fetch_end
> + * through index_endscan).
> + *
> + * The index_only parameter is a compile-time constant at each call site,
> + * allowing the compiler to specialize the code for each variant.
> + */
> +static pg_attribute_always_inline bool
> +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
> +                          TupleTableSlot *slot, bool index_only)

For heapam_index_fetch_tuple_impl() you added _impl, but here you didn't.
Mild preference for also adding _impl here.


> +{
> +    IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
> +    ItemPointer tid;
> +    bool        heap_continue = false;

As mentioned above, it's not clear to me that it is correct for all scan types
to have a short-lived heap_continue.


> +    bool        all_visible = false;
> +    BlockNumber last_visited_block = InvalidBlockNumber;
> +    uint8        n_visited_pages = 0,
> +                xs_visited_pages_limit = 0;
> +
> +    if (index_only)
> +        xs_visited_pages_limit = scan->xs_visited_pages_limit;

Did you check that it's useful to have xs_visited_pages_limit as a longer
lived variable? I suspect it's just going to add register pressure and will
need to be saved/restored across other calls, making it just as cheap to
always fetch it from scan.


> +    for (;;)
> +    {
> +        if (!heap_continue)
> +        {
> +            /* Get the next TID from the index */
> +            tid = index_getnext_tid(scan, direction);
> +
> +            /* If we're out of index entries, we're done */
> +            if (tid == NULL)
> +                break;
> +
> +            /* For index-only scans, check the visibility map */
> +            if (index_only)
> +                all_visible = VM_ALL_VISIBLE(scan->heapRelation,
> +                                             ItemPointerGetBlockNumber(tid),
> +                                             &hscan->xs_vmbuffer);
> +        }
> +
> +        Assert(ItemPointerIsValid(&scan->xs_heaptid));
> +
> +        if (index_only)
> +        {
> +            /*
> +             * We can skip the heap fetch if the TID references a heap page on
> +             * which all tuples are known visible to everybody.  In any case,
> +             * we'll use the index tuple not the heap tuple as the data
> +             * source.
> +             */
> +            if (!all_visible)
> +            {
> +                /*
> +                 * Rats, we have to visit the heap to check visibility.
> +                 */
> +                if (scan->instrument)
> +                    scan->instrument->ntablefetches++;
> +
> +                if (!heapam_index_fetch_heap(scan, slot, &heap_continue))
> +                {

Still find it somewhat weird that we have local 'tid' variable, that we do not
use pass to heapam_index_fetch_heap(), because heapam_index_fetch_heap()
relies on index_getnext_tid() having stored the to-be-fetched tid in
xs_heaptid.

But I guess that's something to change at a later date.


> +                    /*
> +                     * No visible tuple.  If caller set a visited-pages limit
> +                     * (only selfuncs.c does this), count distinct heap pages
> +                     * and give up once we've visited too many.
> +                     */
> +                    if (unlikely(xs_visited_pages_limit > 0))
> +                    {
> +                        BlockNumber blk = ItemPointerGetBlockNumber(tid);
> +
> +                        if (blk != last_visited_block)
> +                        {
> +                            last_visited_block = blk;
> +                            if (++n_visited_pages > xs_visited_pages_limit)
> +                                return false;    /* give up */
> +                        }
> +                    }
> +                    continue;    /* no visible tuple, try next index entry */
> +                }
> +
> +                ExecClearTuple(slot);

Previously the ExecClearTuple() here had at least this comment:
/* We don't actually need the heap tuple for anything */

Now it looks even more confusing...


> @@ -313,7 +313,32 @@ visibilitymap_set(BlockNumber heapBlk,
>   * since we don't lock the visibility map page either, it's even possible that
>   * someone else could have changed the bit just before we look at it, but yet
>   * we might see the old value.  It is the caller's responsibility to deal with
> - * all concurrency issues!
> + * all concurrency issues!  In practice it can't be stale enough to matter for
> + * the primary use case: index-only scans that check whether a heap fetch can
> + * be skipped.
> + *
> + * The argument for why it can't be stale enough to matter for the primary use
> + * case is as follows:
> + *
> + * Inserts: we need to detect that a VM bit was cleared by an insert right
> + * away, because the new tuple is present in the index but not yet visible.
> + * Reading the TID from the index page (under a shared lock on the index
> + * buffer) is serialized with the insertion of the TID into the index (under
> + * an exclusive lock on the same index buffer).  Because the VM bit is cleared
> + * before the index is updated, and locking/unlocking of the index page acts
> + * as a full memory barrier, we are sure to see the cleared bit whenever we
> + * see a recently-inserted TID.
> + *
> + * Deletes: the clearing of the VM bit by a delete is NOT serialized with the
> + * index page access, because deletes do not update the index page (only
> + * VACUUM removes the index TID).  So we may see a significantly stale value.
> + * However, we don't need to detect the delete right away, because the tuple
> + * remains visible until the deleting transaction commits or the statement
> + * ends (if it's our own transaction).  In either case, the lock on the VM
> + * buffer will have been released (acting as a write barrier) after clearing
> + * the bit.  And for us to have a snapshot that includes the deleting
> + * transaction (making the tuple invisible), we must have acquired
> + * ProcArrayLock after that time, acting as a read barrier.
>   */
>  uint8
>  visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)

Nice. Much better place for the comment.


> diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
> index 1408989c5..acc9f3e6a 100644
> --- a/src/backend/access/index/genam.c
> +++ b/src/backend/access/index/genam.c
> @@ -126,6 +126,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
>      scan->xs_hitup = NULL;
>      scan->xs_hitupdesc = NULL;
>
> +    scan->xs_visited_pages_limit = 0;
> +
>      return scan;
>  }

Orthogonal: I suspect eventually we should pass the table to
RelationGetIndexScan(), have the tableam return how much space it needs, and
allocate it one chunk.




> From 85be7bf520fc2746f4ee73a0744c7aa04da55e52 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Tue, 10 Mar 2026 14:40:35 -0400
> Subject: [PATCH v19 03/18] heapam: Track heap block in IndexFetchHeapData.

LGTM.


> Subject: [PATCH v19 04/18] heapam: Keep buffer pins across index rescans.
>
> Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
> (xs_vmbuffer) during heapam_index_fetch_reset.  Retaining these pins
> saves cycles during tight nested loop joins and merge joins that
> frequently restore a saved mark, since the next tuple fetched after a
> rescan often falls on the same heap page.  It can also avoid repeated
> pinning and unpinning of the same buffer when rescans happen to revisit
> the same page.
>
> Preparation for an upcoming patch that will add the amgetbatch
> interface to enable optimizations such as I/O prefetching.
>
> Author: Peter Geoghegan <pg@bowt.ie>
> Reviewed-By: Andres Freund <andres@anarazel.de>
> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
> ---
>  src/backend/access/heap/heapam_indexscan.c | 27 ++++++++++++----------
>  src/backend/access/index/indexam.c         |  6 ++---
>  2 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
> index 4b5e2b30d..82f135e1e 100644
> --- a/src/backend/access/heap/heapam_indexscan.c
> +++ b/src/backend/access/heap/heapam_indexscan.c
> @@ -59,18 +59,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
>  {
>      IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
>
> -    if (BufferIsValid(hscan->xs_cbuf))
> -    {
> -        ReleaseBuffer(hscan->xs_cbuf);
> -        hscan->xs_cbuf = InvalidBuffer;
> -        hscan->xs_blk = InvalidBlockNumber;
> -    }
> +    /* Resets are a no-op (XXX amgetbatch commit resets xs_vm_items here) */
> +    (void) hscan;

I assume that's not a line you intend to commit?

LGTM otherwise.


Need to do something else for a while.  More another time.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: docs: warn about post-data-only schema dumps with parallel restore.
Next
From: Tom Lane
Date:
Subject: Re: docs: warn about post-data-only schema dumps with parallel restore.