Home > mailing lists
Re: index prefetching - Mailing list pgsql-hackers

From	Peter Geoghegan
Subject	Re: index prefetching
Date	December 2, 2025 03:00:45
Msg-id	CAH2-WzkWjKTpV1RnPcwMewyGtYdbEcdpp6vYM3=9dnS3w5PHEw@mail.gmail.com Whole thread
In response to	Re: index prefetching (Tomas Vondra <tomas@vondra.me>)
List	pgsql-hackers
Tree view
On Mon, Dec 1, 2025 at 11:32 AM Tomas Vondra <tomas@vondra.me> wrote:
> Thanks for the new version! I like the layering in this patch, moving
> some of the stuff from indexam.c/executor to table AM. It makes some of
> the code much cleaner, I think.

Yeah, I think so too. Clearly the main way that this improves the
design is by avoiding implicit coordination between code paths that
are a great distance away from each other. Particularly with index
prefetching, where previously we maintained a visibility info cache
for use in indexOnlyscan.c, that was also used by the read stream
callback. There's no need for such coordination if it all has to
happen from the same few table AM routines.

> > I'm sure that I'll have made numerous mistakes in this new v1. There
> > will certainly be some bugs, and some of the exact details of how I'm
> > doing the layering are likely suboptimal or even wrong. I am
> > nevertheless cautiously optimistic that this will be the last major
> > redesign that will be required for this project.
> >
>
> Sounds good. FWIW I don't see any major issues in this version.

I was thinking of stuff like how the heapam data structure still
doesn't actually contain the read stream, so that indexam.c can call
indexbatch.c and do things like reset the read stream if necessary.
Stuff like that.

Maybe we should be calling index_batch_init from heapam_handler.c, and
not from indexam.c (we still do the latter). OTOH, maybe that'd be a
case of adding more mechanism for no real benefit. These kinds of
design choices are relatively unimportant, but did seem like the kind
of thing that I'm relatively likely to have messed up in this v1.

> > (It isn't intrinsically all that complicated to add the optimization
> > with this new table AM orientated structure, but doing so would have
> > made performance validation work/avoiding regressions with simple
> > queries that much harder. So I just put it off for a bit longer.)
> >
>
> Understood. I presume that optimization fits mostly "seamlessly" into
> this patch design.

Right. Obviously, that's another advantage of the new table AM interface.

We could even do something much more sophisticated than what I
actually have planned for 19: we could reorder table fetches, such
that we only had to lock and pin each heap page exactly once *even
when the TIDs returned by the index scan return TIDs slightly out of
order*. For example, if an index page/batch returns TIDs "(1,1),
(2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1
and 2 by locking and pinning each of those 2 pages exactly once. The
only downside (other than the complexity) is that we'd sometimes hold
multiple heap page pins at a time, not just one.

I think of this as making index scans behave somewhat more like bitmap
scans. It might even make sense to do it very aggressively. We don't
have to hold on to pins if we can materialize/make our own private
copies of the tuples for later.

This is very speculative stuff. I won't be working on anything this
complicated any time soon. But I think it's good that to have a
structure that enables this kind of thing.

> I admit I was a bit skeptical about this approach, mostly because I
> didn't have a clear idea how would it work. But it turns out to be quite
> clean. Well, definitely cleaner that what I had before.

It's the least worst way of implementing a design that gives the table
AM the required understanding of index AM costs. Which is just what's
required to do I/O prefetching as efficiently as possible.

> Agreed. I don't think it makes sense to require eliminating all these
> table_index_fetch_tuple calls (even if it was possible).

One could argue that the remaining use of table_index_fetch_tuple is
still a modularity violation, since we're still using TIDs instead of
some abstract concept that generalizes the idea of TID across all
possible table AMs. But that's not a new problem, and not one that
we're in any way obligated to fix within the scope of this project.

> I realize this also removes mark/restore support from the old
> "amgettuple" interface, so only AMs implementing the new batching API
> will be able to do mark/restore.

Right.

> AFAICS in core this only affects btree,
> and that is switched to the batching. But won't this break external AMs
> that might do mark/restore, and don't want to / can't do batching?

Yes.

> I'm not aware of any such AMs, though. Maybe it's fine.

If somebody shows up and complains about it, we can do something about
it then. But I'd rather not add code to deal with a 100% theoretical
problem such as this. I really doubt that there will be any
complaints.

> Do we want to make "per-leaf-page" batches explicit in the commit
> message / comments? Yes, we do create per-leaf batches, but isn't it
> more because it's convenient, and the AM could create larger/smaller
> batches if appropriate?

It is a structure that forces index AMs to follow the existing long
documented rules about holding onto buffer locks as an interlock
against unsafe concurrent TID recycling by VACUUM. There's nothing
fundamentally new about it.

> Or is this a requirement? I'm thinking about
> doing batching for gist/spgist ordered scans, where the index entries
> are not returned leaf-at-a-time.

It's true that GiST and SP-GiST don't return tuples a leaf at a time,
and never hold on to buffer pins. It's also true that both support
index-only scans that can give wrong answers to queries, precisely
because they just ignore the rules we have for index AMs.

This presents us with an absurd dilemma: should we make amgetbatch
work with those requirements, even though we know that they're based
on a faulty understanding of the basic protocols that index scans are
supposed to follow? It would be very difficult to make ordered GiST
scans hold the required buffer pins sufficient to avoid races with
VACUUM, fixing the index-only scan bug -- no question. But that
difficulty has exactly nothing to do with this project.

The easiest way to fix the bugs in GiST is likely to be by disabling
KNN ordered index-only scans, while making remaining index-only scans
correctly follow the index AM protocol for the first time (by holding
onto leaf page pins until the table AM is done reading the heap tuples
for the page's TIDs). I think that we'll probably need to disable
index-only KNN scans, since I suspect that there just isn't a way to
keep the number of pins held manageably low in the general case. Once
all that happens (in addition to making GiST VACUUM acquire
conflicting cleanup locks like nbtree does already), then it should be
possible to adopt GiST to the amgetbatch paradigm with some more work.
(Recall that plain index scans that use an MVCC snapshot don't
actually need to hold onto buffer pins on leaf pages.)

Adopting GiST to amgetbatch then becomes a matter of inventing a new
layer that either treats GiST leaf pages just like nbtree leaf pages,
or (in the case of KNN scans) builds virtual batches or somesuch. A
virtual batch doesn't ever have a buffer pin in its batch, since the
relationship between index leaf pages and the contents of the batch
are fuzzy. In general the use of virtual batches is undesirable,
though, because they are inherently incompatible with index-only
scans.

> Another question  about IOS with xs_hitup - which is not supported by
> the patch (only IOS with xs_itup are). Is there a reason why this can't
> be supported? I can't think of any, maybe I'm missing something?

We don't support them for the kinds of reasons you'd guess: they're
really only useful in GiST and SP-GiST, which aren't going to be
supported in the first release regardless of what we do (btw, pgvector
doesn't use xs_hitup either, nor does it support any kind of
index-only scan). They're also just awkward, because we can't assume
that BLCKSZ space will always be enough to store all the required heap
tuples (GiST uses retail palloc()s for the heap tuples). And because
we have no way to test xs_hitup support.

These reasons don't make it fundamentally impossible. On the other
hand, not supporting xs_hitup doesn't close any doors to adding such
support in a later release.

> There's also the question whether amgettuple/amgetbatch should be
> exclusive, or an AM could support both. In the docs the patch seems to
> imply it's exclusive, but then it also says "XXX uncertain" about this.

I lean towards requiring that index AMs choose one or the other in the
first committed version. This is a reversible choice, after all.

> But maybe it's not worth it? I'm concerned about painting ourselves in
> the corner, where some index AM can't do batching for one corner case,
> and therefore it can't do batching at all.

Maybe not. I know that I said that I think that it might make sense to
keep amgettuple to allow things like KNN GiST scans to continue to
work. But now I'm not so sure. The better paradigm might still be to
invent the concept of virtual batches. That allows index AMs to deal
with tricky cases like KNN GiST scans as their own implementation
detail (mostly). It's not particularly natural for such index AMs to
use amgetbatch right now...but if they're sometimes doing so anyway,
that isn't really true anymore.

> I find the new "BatchIndexScan" name a bit confusing, it sounds more
> like a special type of IndexScan. Maybe IndexScanBatch would be better?

I did have that at one point, but was then concerned that it implied
that the struct belonged in a file like indexam.c, which it does not.

Do you have another suggestion?

> Also, I find the batch_assert_pos_valid/batch_assert_batch_valid naming
> a bit surprising. I think the custom is to name "asserts" function
> something like AssertSomethingSomething(), to make it distinct from
> usual functions. At least that's what I saw in other patches, and I
> followed that practice ... But maybe it's not suitable for non-static
> functions.

I don't feel strongly either way.

> Speaking of batch_assert_batches_valid, why not to add it to relscan.h,
> next to the other "asserts"?

No good reason. Will fix.

> Let's see how that goes. I'm not against putting that off until Postgres
> 20, but maybe it's too early to make that decision. I'd like to at least
> give it a try for 19. If it doesn't make it, that's fine.

Perhaps you're right -- I could have overreacted when I said that I/O
prefetching for 19 just wasn't going to happen. I don't feel bad about
putting that back in scope now, as long as the primary goal remains
getting the API changes (as well as the heap buffer locking
optimization) in place.

> > Likely the best solution to the problems posed by GiST and SP-GiST
> > will be to choose one of either amgettuple and amgetbatch during
> > planning, according to what the scan actually requires (while having
> > support for both interfaces in both index AMs). I'm still not sure
> > what that should look like, though -- how does the planner know which
> > interface to use, in a world where it has to make a choice with those
> > index AMs that offer both? Obviously the answer depends in part on
> > what actually matters to GiST/where GiST *can* reasonably use
> > amgetbatch, to get benefits such as prefetching. And I don't claim to
> > have a full understanding of that right now.
> >
>
> Right, that's pretty much what I suggested earlier.

Like I said just now, this seems pretty complicated. So complicated
that not requiring the planner to figure it out at all (pushing the
problem into index AMs like GiST) has a certain appeal.

It's not like amgettuple and amgetbatch are all that different. The
main difference is that as things stand GiST cannot just use all the
amgettuple state it currently holds in
GISTScanOpaqueData.GISTSearchHeapItem.

> > * Review of the table AM changes, with a particular emphasis on high
> > level architectural choices.
> >
>
> To me the proposed architecture/layering looks nice and reasonable.

Cool.

> But I haven't really thought about table AM until Andres pointed out the
> issues, so maybe I may not be the right person to judge this. And I've
> moved the code between the various layers so many times I have vertigo.

I can certainly sympathize with that.

> I don't see why we would paint ourselves in the corner with this. (I'm
> ignoring the question about allowing only one of amgettuple/amgetbatch.)

We can change our mind about requiring exactly one of amgettuple or
amgetbatch in the future. It's a completely reversible design
decision. We could even add a caveat about it to the sgml docs that
cover the index AM API.

> > * Help with putting the contract that amgetbatch requires of index AMs
> > on a more rigorous footing. In other words, is amgetbatch itself
> > sufficiently general to accomodate the needs of index AMs in the
> > future? I've made a start on that here (by adding sgml docs about the
> > index AM API, which mentions table AM concerns), but work remains,
> > particularly when it comes to supporting GiST + SP-GiST.
> >
>
> So what exactly is the "contract" assumed by the current patch? Do you
> have any thoughts about it being too inflexible in some respect?

I was a tiny bit worried about the xs_hitup support question, but less
so now. That was one. I guess that I also wondered if the use of
fields like "moreLeft" and "moreRight" was sufficiently general. I
think it probably is, actually, but it's a question that needs to be
asked.

pgvector doesn't support index-only scans at all, and only does MVCC
snapshots, so AFAICT it will always be safe to assume that we can
always drop a pin on a pgvector index page (especially because it
doesn't do kilitems stuff). From there I think it's just a matter of
building virtual/simulated batches. I'm not sure where the logic to do
that belongs, but I suspect that it might belong in pgvector. After
all, pgvector probably shouldn't be forced to scan most of the index
to get the required number of TIDs to make up a decent sized batch. It
needs to decide that the time has come to at least return the matches
we have already.

> As mentioned earlier, maybe we shouldn't tie batches to leaf pages too
> much, so that the AM can build batches not aligned to leaf-pages in such
> a simple way. I think this would allow doing batchning/prefetching for
> cases like the spgist ordered scans, etc.

That makes sense. But I don't think that we necessarily have to have
that fully worked out to commit the first patch. After all, the
problems in that area are just really hard, for reasons that have very
little to do with new stuff introduced by this patch series.

--
Peter Geoghegan
pgsql-hackers by date:
From: Chao Li
Date: 02 December 2025, 02:29:49
Subject: Re: split tablecmds.c
From: Michael Paquier
Date: 02 December 2025, 03:03:53
Subject: Re: split tablecmds.c
Re: index prefetching - Mailing list pgsql-hackers

Previous

Next