Re: index prefetching - Mailing list pgsql-hackers
| From | Peter Geoghegan |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | CAH2-WzkWjKTpV1RnPcwMewyGtYdbEcdpp6vYM3=9dnS3w5PHEw@mail.gmail.com Whole thread Raw |
| In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
On Mon, Dec 1, 2025 at 11:32 AM Tomas Vondra <tomas@vondra.me> wrote: > Thanks for the new version! I like the layering in this patch, moving > some of the stuff from indexam.c/executor to table AM. It makes some of > the code much cleaner, I think. Yeah, I think so too. Clearly the main way that this improves the design is by avoiding implicit coordination between code paths that are a great distance away from each other. Particularly with index prefetching, where previously we maintained a visibility info cache for use in indexOnlyscan.c, that was also used by the read stream callback. There's no need for such coordination if it all has to happen from the same few table AM routines. > > I'm sure that I'll have made numerous mistakes in this new v1. There > > will certainly be some bugs, and some of the exact details of how I'm > > doing the layering are likely suboptimal or even wrong. I am > > nevertheless cautiously optimistic that this will be the last major > > redesign that will be required for this project. > > > > Sounds good. FWIW I don't see any major issues in this version. I was thinking of stuff like how the heapam data structure still doesn't actually contain the read stream, so that indexam.c can call indexbatch.c and do things like reset the read stream if necessary. Stuff like that. Maybe we should be calling index_batch_init from heapam_handler.c, and not from indexam.c (we still do the latter). OTOH, maybe that'd be a case of adding more mechanism for no real benefit. These kinds of design choices are relatively unimportant, but did seem like the kind of thing that I'm relatively likely to have messed up in this v1. > > (It isn't intrinsically all that complicated to add the optimization > > with this new table AM orientated structure, but doing so would have > > made performance validation work/avoiding regressions with simple > > queries that much harder. So I just put it off for a bit longer.) > > > > Understood. I presume that optimization fits mostly "seamlessly" into > this patch design. Right. Obviously, that's another advantage of the new table AM interface. We could even do something much more sophisticated than what I actually have planned for 19: we could reorder table fetches, such that we only had to lock and pin each heap page exactly once *even when the TIDs returned by the index scan return TIDs slightly out of order*. For example, if an index page/batch returns TIDs "(1,1), (2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1 and 2 by locking and pinning each of those 2 pages exactly once. The only downside (other than the complexity) is that we'd sometimes hold multiple heap page pins at a time, not just one. I think of this as making index scans behave somewhat more like bitmap scans. It might even make sense to do it very aggressively. We don't have to hold on to pins if we can materialize/make our own private copies of the tuples for later. This is very speculative stuff. I won't be working on anything this complicated any time soon. But I think it's good that to have a structure that enables this kind of thing. > I admit I was a bit skeptical about this approach, mostly because I > didn't have a clear idea how would it work. But it turns out to be quite > clean. Well, definitely cleaner that what I had before. It's the least worst way of implementing a design that gives the table AM the required understanding of index AM costs. Which is just what's required to do I/O prefetching as efficiently as possible. > Agreed. I don't think it makes sense to require eliminating all these > table_index_fetch_tuple calls (even if it was possible). One could argue that the remaining use of table_index_fetch_tuple is still a modularity violation, since we're still using TIDs instead of some abstract concept that generalizes the idea of TID across all possible table AMs. But that's not a new problem, and not one that we're in any way obligated to fix within the scope of this project. > I realize this also removes mark/restore support from the old > "amgettuple" interface, so only AMs implementing the new batching API > will be able to do mark/restore. Right. > AFAICS in core this only affects btree, > and that is switched to the batching. But won't this break external AMs > that might do mark/restore, and don't want to / can't do batching? Yes. > I'm not aware of any such AMs, though. Maybe it's fine. If somebody shows up and complains about it, we can do something about it then. But I'd rather not add code to deal with a 100% theoretical problem such as this. I really doubt that there will be any complaints. > Do we want to make "per-leaf-page" batches explicit in the commit > message / comments? Yes, we do create per-leaf batches, but isn't it > more because it's convenient, and the AM could create larger/smaller > batches if appropriate? It is a structure that forces index AMs to follow the existing long documented rules about holding onto buffer locks as an interlock against unsafe concurrent TID recycling by VACUUM. There's nothing fundamentally new about it. > Or is this a requirement? I'm thinking about > doing batching for gist/spgist ordered scans, where the index entries > are not returned leaf-at-a-time. It's true that GiST and SP-GiST don't return tuples a leaf at a time, and never hold on to buffer pins. It's also true that both support index-only scans that can give wrong answers to queries, precisely because they just ignore the rules we have for index AMs. This presents us with an absurd dilemma: should we make amgetbatch work with those requirements, even though we know that they're based on a faulty understanding of the basic protocols that index scans are supposed to follow? It would be very difficult to make ordered GiST scans hold the required buffer pins sufficient to avoid races with VACUUM, fixing the index-only scan bug -- no question. But that difficulty has exactly nothing to do with this project. The easiest way to fix the bugs in GiST is likely to be by disabling KNN ordered index-only scans, while making remaining index-only scans correctly follow the index AM protocol for the first time (by holding onto leaf page pins until the table AM is done reading the heap tuples for the page's TIDs). I think that we'll probably need to disable index-only KNN scans, since I suspect that there just isn't a way to keep the number of pins held manageably low in the general case. Once all that happens (in addition to making GiST VACUUM acquire conflicting cleanup locks like nbtree does already), then it should be possible to adopt GiST to the amgetbatch paradigm with some more work. (Recall that plain index scans that use an MVCC snapshot don't actually need to hold onto buffer pins on leaf pages.) Adopting GiST to amgetbatch then becomes a matter of inventing a new layer that either treats GiST leaf pages just like nbtree leaf pages, or (in the case of KNN scans) builds virtual batches or somesuch. A virtual batch doesn't ever have a buffer pin in its batch, since the relationship between index leaf pages and the contents of the batch are fuzzy. In general the use of virtual batches is undesirable, though, because they are inherently incompatible with index-only scans. > Another question about IOS with xs_hitup - which is not supported by > the patch (only IOS with xs_itup are). Is there a reason why this can't > be supported? I can't think of any, maybe I'm missing something? We don't support them for the kinds of reasons you'd guess: they're really only useful in GiST and SP-GiST, which aren't going to be supported in the first release regardless of what we do (btw, pgvector doesn't use xs_hitup either, nor does it support any kind of index-only scan). They're also just awkward, because we can't assume that BLCKSZ space will always be enough to store all the required heap tuples (GiST uses retail palloc()s for the heap tuples). And because we have no way to test xs_hitup support. These reasons don't make it fundamentally impossible. On the other hand, not supporting xs_hitup doesn't close any doors to adding such support in a later release. > There's also the question whether amgettuple/amgetbatch should be > exclusive, or an AM could support both. In the docs the patch seems to > imply it's exclusive, but then it also says "XXX uncertain" about this. I lean towards requiring that index AMs choose one or the other in the first committed version. This is a reversible choice, after all. > But maybe it's not worth it? I'm concerned about painting ourselves in > the corner, where some index AM can't do batching for one corner case, > and therefore it can't do batching at all. Maybe not. I know that I said that I think that it might make sense to keep amgettuple to allow things like KNN GiST scans to continue to work. But now I'm not so sure. The better paradigm might still be to invent the concept of virtual batches. That allows index AMs to deal with tricky cases like KNN GiST scans as their own implementation detail (mostly). It's not particularly natural for such index AMs to use amgetbatch right now...but if they're sometimes doing so anyway, that isn't really true anymore. > I find the new "BatchIndexScan" name a bit confusing, it sounds more > like a special type of IndexScan. Maybe IndexScanBatch would be better? I did have that at one point, but was then concerned that it implied that the struct belonged in a file like indexam.c, which it does not. Do you have another suggestion? > Also, I find the batch_assert_pos_valid/batch_assert_batch_valid naming > a bit surprising. I think the custom is to name "asserts" function > something like AssertSomethingSomething(), to make it distinct from > usual functions. At least that's what I saw in other patches, and I > followed that practice ... But maybe it's not suitable for non-static > functions. I don't feel strongly either way. > Speaking of batch_assert_batches_valid, why not to add it to relscan.h, > next to the other "asserts"? No good reason. Will fix. > Let's see how that goes. I'm not against putting that off until Postgres > 20, but maybe it's too early to make that decision. I'd like to at least > give it a try for 19. If it doesn't make it, that's fine. Perhaps you're right -- I could have overreacted when I said that I/O prefetching for 19 just wasn't going to happen. I don't feel bad about putting that back in scope now, as long as the primary goal remains getting the API changes (as well as the heap buffer locking optimization) in place. > > Likely the best solution to the problems posed by GiST and SP-GiST > > will be to choose one of either amgettuple and amgetbatch during > > planning, according to what the scan actually requires (while having > > support for both interfaces in both index AMs). I'm still not sure > > what that should look like, though -- how does the planner know which > > interface to use, in a world where it has to make a choice with those > > index AMs that offer both? Obviously the answer depends in part on > > what actually matters to GiST/where GiST *can* reasonably use > > amgetbatch, to get benefits such as prefetching. And I don't claim to > > have a full understanding of that right now. > > > > Right, that's pretty much what I suggested earlier. Like I said just now, this seems pretty complicated. So complicated that not requiring the planner to figure it out at all (pushing the problem into index AMs like GiST) has a certain appeal. It's not like amgettuple and amgetbatch are all that different. The main difference is that as things stand GiST cannot just use all the amgettuple state it currently holds in GISTScanOpaqueData.GISTSearchHeapItem. > > * Review of the table AM changes, with a particular emphasis on high > > level architectural choices. > > > > To me the proposed architecture/layering looks nice and reasonable. Cool. > But I haven't really thought about table AM until Andres pointed out the > issues, so maybe I may not be the right person to judge this. And I've > moved the code between the various layers so many times I have vertigo. I can certainly sympathize with that. > I don't see why we would paint ourselves in the corner with this. (I'm > ignoring the question about allowing only one of amgettuple/amgetbatch.) We can change our mind about requiring exactly one of amgettuple or amgetbatch in the future. It's a completely reversible design decision. We could even add a caveat about it to the sgml docs that cover the index AM API. > > * Help with putting the contract that amgetbatch requires of index AMs > > on a more rigorous footing. In other words, is amgetbatch itself > > sufficiently general to accomodate the needs of index AMs in the > > future? I've made a start on that here (by adding sgml docs about the > > index AM API, which mentions table AM concerns), but work remains, > > particularly when it comes to supporting GiST + SP-GiST. > > > > So what exactly is the "contract" assumed by the current patch? Do you > have any thoughts about it being too inflexible in some respect? I was a tiny bit worried about the xs_hitup support question, but less so now. That was one. I guess that I also wondered if the use of fields like "moreLeft" and "moreRight" was sufficiently general. I think it probably is, actually, but it's a question that needs to be asked. pgvector doesn't support index-only scans at all, and only does MVCC snapshots, so AFAICT it will always be safe to assume that we can always drop a pin on a pgvector index page (especially because it doesn't do kilitems stuff). From there I think it's just a matter of building virtual/simulated batches. I'm not sure where the logic to do that belongs, but I suspect that it might belong in pgvector. After all, pgvector probably shouldn't be forced to scan most of the index to get the required number of TIDs to make up a decent sized batch. It needs to decide that the time has come to at least return the matches we have already. > As mentioned earlier, maybe we shouldn't tie batches to leaf pages too > much, so that the AM can build batches not aligned to leaf-pages in such > a simple way. I think this would allow doing batchning/prefetching for > cases like the spgist ordered scans, etc. That makes sense. But I don't think that we necessarily have to have that fully worked out to commit the first patch. After all, the problems in that area are just really hard, for reasons that have very little to do with new stuff introduced by this patch series. -- Peter Geoghegan
pgsql-hackers by date: