Re: index prefetching - Mailing list pgsql-hackers
| From | Peter Geoghegan |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com Whole thread Raw |
| In response to | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
| Responses |
Re: index prefetching
|
| List | pgsql-hackers |
On Mon, Nov 10, 2025 at 6:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > The new tentative plan is to cut scope by focussing on switching over > to the new index AM + table AM interface from the patch in the short > term, for Postgres 19. Attached patch makes the table AM revisions we talked about. This is a significant change in direction, so I'm adopting a new patch versioning scheme: this new version is v1. (I just find it easier to deal with sequential patch version numbers.) I'm sure that I'll have made numerous mistakes in this new v1. There will certainly be some bugs, and some of the exact details of how I'm doing the layering are likely suboptimal or even wrong. I am nevertheless cautiously optimistic that this will be the last major redesign that will be required for this project. > There is an almost immediate benefit to just > doing that much, unrelated to I/O prefetching for index scans: it > enables batching of heap page buffer locking/unlocking (during the > point where index scans perform heap_hot_search_buffer calls) on the > table AM/heapam side during ordered index scans. Note that this new v1 doesn't yet include the important heapam buffer locking optimization discussed here. It didn't seem worth holding up everything just for that. Plan is to get to it next. (It isn't intrinsically all that complicated to add the optimization with this new table AM orientated structure, but doing so would have made performance validation work/avoiding regressions with simple queries that much harder. So I just put it off for a bit longer.) What's new in v1 (compared to v20251109-*, the prior version): * The first patch in the series is now mostly about changing the table AM and index AM in a complementary way (not just about adding the amgetbatch interface to the index AM). To summarize this point (mostly just a recap of recent discussion on the table AM API on this thread) with its own sub points: - We're now using a slot-based table AM interface that understands scan direction. We now do all VM access for index-only scans on the heapam side, fixing that existing table AM modularity violation once and for all. - Batches returned by amgetbatch are directly managed by heapam, giving it the direct control that it requires to get the best possible performance. Whether that's for adding I/O prefetching, or for other optimizations. - The old table_index_fetch_tuple index scan interface is still needed -- though only barely. The rule going forward for core executor code is that it should always use this new slot-based interface, unless there is a specific need for such a caller to pass *their own* TID, in a way that cannot possibly be delegated to our new high level table AM interface. For example, we still need table_index_fetch_tuple for nbtree's _bt_check_unique; it must pass TIDs to heapam, and get back tuples, without starting any new index scan to do so (the only "index scan" involved in the case of the _bt_check_unique caller takes place in the btinsert that needs to perform unique index enforcement in passing). I think it makes perfect sense that a small handful of special case callers still need to use table_index_fetch_tuple, since there really is no way around the need for these callers to pass their own TID. * Major restructuring of batch management code, to allow it to work with the table AM interface (as well as related improvements and polishing). The parts of batch management that aren't under the direct control of table AMs/heapam (the batch helper functions that all table AMs will use) are no longer in indexam.c; there's a new file for those routines named indexbatch.c. indexbatch.c is also the place where a few other helper functions go. These other functions are called by indexam.c/the core executor, for things like initializing an amgetbatch scan, and informing nbtree that it is taking a mark (for mark/restore). Maybe there are certain remaining problems with the way that indexam.c and heapam_handler.c are coordinating across index scans. Hopefully the structure wasn't accidentally overfitted to heapam/isn't brittle in some other way. * Renamed and made lots of tweaks to batching related functions and structs. I've also integrated code that previously appeared in its own "batch cache" patch into the new main commit in the patch series (the first patch in the new series). The main goal of the tweaks to the data structures was to avoid indirection that previously caused small regressions in my microbenchmarks. We're very sensitive to costs from additional pointer chasing in these code paths. And from even small memory allocations. I think that I've avoided all regressions with just the first patch, at least for my own microbenchmark suite. I did not aim to avoid these regressions with the prefetching patch, since I consider it out of scope now (for Postgres 19). * v1 breaks out prefetching into its own patch, which is now the second patch in the patch series. The new I/O prefetching patch turned out to be surprisingly small. I still feel good about our choice to put that off until Postgres 20, though -- it's definitely where most of the difficulties are. Especially with things like resource management. (The problem with the second patch is that it's too small/doesn't address all the problems, not that it's too big and unwieldy.) Prefetching works at least as well as it did in earlier versions (maybe even slightly better). It's not just an afterthought here. At a minimum, we need to continue to maintain prefetching in a reasonably complete and usable form to keep us honest about the design changes in the table AM and index AM APIs. If the design itself cannot eventually accommodate Postgres 20 work on I/O prefetching (and even later work), then it's no good. Minor caveat about preserving prefetching in good working order: I disabled support for index-only scans that use I/O prefetching for heap accesses in the second patch, at least for now. To recap, IoS support requires a visibility cache so that both readBatch and streamBatch agree on exactly which heap blocks will need to be read, even when the visibility map has some relevant heap page bits concurrently set or unset. It won't be too hard to add something like that back to heapam_handler.c, but I didn't get around to doing so just yet. It might be independently useful to have some kind of visibility cache, even without prefetching; batching VM accesses (say by doing them up front, for a whole batch, right after amgetbatch returns) might work out saving cycles with cached scans. You know, somewhat like how we'll do same-heap-page heap tuple fetches eagerly as a way of minimizing buffer lock/unlock traffic. * There's a new patch that adds amgetbatch support for hash indexes. This demonstrates that the amgetbatch interface is already reasonably general. And that adding support to an index AM doesn't have to be all that invasive. I'm more focussed than ever on the generality of the API now. * Added documentation that attempts to formalize the constraints that index AMs that opt to use amgetbatch are under. I don't think that it makes sense to think of amgettuple as the legacy interface for plain index scans. There will probably always be cases like KNN GiST scans, that legitimately need the index AM to directly control the progress of index scans, a tuple at a time. After all, these scan types give an absurd amount of control over many things to the index AM -- that seems to really make it hard to put the table AM in control of the scan's progress. For example, GiST scans use their own GISTSearchHeapItem struct to manage each item returned to the scan (which has a bunch of extra fields compared to our new AM-generic BatchMatchingItem struct). This GISTSearchHeapItem struct allows GiST to indicate whether or not an index tuple's quals must be rechecked. It works at the tuple granularity (individual GiST opclasses might expect that level of flexibility), which really works against the batching concepts that we're pursuing here. It's true that hash index scans are also lossy, but that's quite different: they're inherently lossy. It's not as if hash index scans are sometimes not lossy. They certainly cannot be lossy for some tuples but not other tuples that all get returned during the same index scan. Not so with GiST scans. Likely the best solution to the problems posed by GiST and SP-GiST will be to choose one of either amgettuple and amgetbatch during planning, according to what the scan actually requires (while having support for both interfaces in both index AMs). I'm still not sure what that should look like, though -- how does the planner know which interface to use, in a world where it has to make a choice with those index AMs that offer both? Obviously the answer depends in part on what actually matters to GiST/where GiST *can* reasonably use amgetbatch, to get benefits such as prefetching. And I don't claim to have a full understanding of that right now. Here are the things that I'd like to ask from reviewers, and from Tomas: * Review of the table AM changes, with a particular emphasis on high level architectural choices. * Most importantly: will the approach in this new v1 avoid painting ourselves into a corner? It can be incomplete, as long as it doesn't block progress on things we're likely to want to do in the next couple of releases. * Help with putting the contract that amgetbatch requires of index AMs on a more rigorous footing. In other words, is amgetbatch itself sufficiently general to accomodate the needs of index AMs in the future? I've made a start on that here (by adding sgml docs about the index AM API, which mentions table AM concerns), but work remains, particularly when it comes to supporting GiST + SP-GiST. I think it makes sense to keep feedback mostly high level for now -- to make it primarily about how the individual API changes fit together, if they're coordinating too much (or not enough), and if the interface we have is able to accommodate future needs. -- Peter Geoghegan
Attachment
pgsql-hackers by date: