Re: index prefetching - Mailing list pgsql-hackers
| From | Peter Geoghegan |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com Whole thread Raw |
| In response to | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
| Responses |
Re: index prefetching
Re: index prefetching |
| List | pgsql-hackers |
On Tue, Jan 13, 2026 at 3:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > The batch stopped applying again. Attached is v7. Attached is v8. A lot has changed in just the past week. We're feeling optimistic about getting I/O prefetching into Postgres 19, and so we've decided to fully commit ourselves to that. All of the remaining kludges that we've relied on up until now have either been removed, or replaced with a principled approach. Notable changes in this v8 compared to v7: * We no longer "reset" the read stream to deal with exhaustion of the available slots for batches. We now rely on a patch of Thomas Munro's [1] to pause the read stream from within the new heapam callback, in the event of running out of batch slots. When this happens we pause only for as long as it takes for the scan to return enough items to make the scan need to advance scanPos to the next batch in the ring buffer (obviously, this next batch must be already loaded in this scenario). At that point we'll resume the read stream, which will once again be able to call amgetbatch to store another batch in the just-freed slot. Running out of batch slots like this happens rarely, so it's important to have a reasonably simple and elegant solution. Which this seems to be. Importantly, there's no "ping ponging" behavior here. * Removed grotty heuristics in our read stream call back to avoid regressions. These first appeared many months ago, and no longer appear to be necessary. There's still some regressions with pathological cases, but they seem to be well within acceptable bounds. I saw about a 10% remaining increase in query execution time for an adversarially crafted query that Tomas came up with back in August [2]. That query provided the original justification for my inventing those heuristics. * We now have a real strategy around resource management and buffer pins. A concern long held by Andres (and others) was that holding on to many index leaf page buffer pins would somehow conflict with read stream's careful management of heap page buffer pins [3]. We were particularly concerned about hard to hit cases where the read stream is somehow limited by recently acquired buffer pins for index pages. This was probably the thing that made me most doubtful about our ability to get the prefetching patch in shape for Postgres 19. But that's completely changed in just the past week. Dropping leaf page buffer pins during index-only scans ------------------------------------------------------ I realized (following a point made by Matthias van de Meent during a private call) that it simply isn't necessary for index pages to hold on to buffer pins, in general. We haven't actually done that with nbtree plain index scans with MVCC snapshots since 2015's commit 2ed5b87f, which added what we now call nbtree's dropPin optimization. What if we could find a way to teach *every* amgetbatch-supporting index AM to do the same trick for *all* scans (barring non-MVCC snapshot scans)? Including index-only scans and scans of unlogged relations? Then we'd have zero chance of unintended interactions with the read stream; there'd simply be no extra buffer pins that might confuse the read stream in the first place! Matthias told me that his patch to fix the bugs in GiST index-only scans works by not dropping a pin on a GiST leaf page right away. It delays dropping such pins, but only very slightly: if we cache visibility information from the VM (which we're doing already in the amgetbatch patch, and which Matthias' patch does too), and delay dropping a batch's leaf page pin until after its VM cache is loaded, it reliably avoids races of the kind that we need to be worried about here. In short, we can eagerly drop buffer pins during index-only scans in the same way (or virtually the same way) that we've long been able to with nbtree plain index scans thanks to nbtree's dropPin optimization. The race in question involves VACUUM concurrently setting a VM page all visible on a heap page with a TID that is also recycled by VACUUM (as it sets its page all-visible). We can safely allow VACUUM to go ahead with this while still dropping our pin early -- provided we build our local cache of visibility information first. Holding on to a leaf page pin while reading from the VM suffices. The important principle is that our local cache of VM info is (and will remain) consistent with what we saw on the index page when we saved its matching TIDs into a batch. (It doesn't matter that we do heap fetches for now-all-visible pages, because they cannot possibly be visible to the scan's MVCC snapshot. Just like in the plain index scan dropPin case. And rather like bitmap index scans.) v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch has a new isolation test that demonstrates the new "drop pins eagerly during index-only scans" behavior, which is named index-only-scan-visibility.spec. The isolation test is a variant of the one I posted on the GiST thread, which proved that GiST is broken here (a problem that Matthias is working on fixing). If you attempt to run this isolation test on master, it'll block forever; VACUUM can never acquire a cleanup lock due to a conflicting buffer pin held by an index-only scan. That doesn't happen with v8 of the patch, though; it completes in less than 20ms on my system (and the scan actually gives correct results!). This still leaves non-MVCC snapshot scans. There's nothing we can do to avoid holding on to a leaf page buffer pin while accessing the heap there. But that's okay; now we just refuse to do I/O prefetching during such scans. Dropping leaf page buffer pins during scans of an unlogged relation ------------------------------------------------------------------- Another thing that hinders nbtree's dropPin optimization (and that we must deal with to get a guarantee that leaf page buffer pins never really need to be kept around) is the use of an unlogged relation. That breaks dropPin's approach to detecting unsafe concurrent TID recycling when marking dead items LP_DEAD on index pages, since that involves stashing a page LSN, and then checking if it has changed later on. We solve that problem by introducing GiST style "fake LSNs" to both ntbree and hash. Now the same LSN trick works for unlogged relations, too. We now require that any other amgetbatch index AMs that might be added in the future also use fake LSNs like this. Alternatively, such an index AM could just not provide a _bt_killitems-like mechanism at all -- that also works. Or, they could limit the use of such a mechanism to logged relations. Third party table AMs don't really need to deal with this themselves, though. Performance impact of calling BufferGetLSNAtomic during affected scans ---------------------------------------------------------------------- Our expanded use of BufferGetLSNAtomic() during index-only scans has the potential to cause regressions, at least when page checksums are enabled. We're planning on relying on a patch of Andreas Karlsson's to make BufferGetLSNAtomic use an atomic op [4], which fixes this regression. I'm not including that here, though (I would but for the fact that it breaks the Debian Trixie CI target due to a known misaligned access bug that Andreas is working on fixing). Anybody that does performance validation of either index-only scans or scans of an unlogged relation should bear that in mind. > > I still haven't had time to produce an implementation of the "heap > > buffer locking minimization" optimization that's clean enough to > > present to the list. > > Still haven't done this. This idea has now been deprioritized. Quite a few things have fallen together recently, so we're "pivoting back" to work on prefetching for Postgres 19. [1] https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com [2] https://postgr.es/m/9af33041-1c16-4973-855a-718aa1048ee1@vondra.me [3] https://postgr.es/m/mc5w6mj52dzl7ant7nmjmwxjmvmlwekwjmf77eotrra3pghrfl@d7mq3hxvdapa [4] https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se -- Peter Geoghegan
Attachment
- v8-0006-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch
- v8-0008-Make-hash-index-AM-use-amgetbatch-interface.patch
- v8-0007-Add-fake-LSN-support-to-hash-index-AM.patch
- v8-0005-Add-prefetching-to-index-scans-using-batch-interf.patch
- v8-0001-Extract-fake-LSN-infrastructure-from-GiST-index-A.patch
- v8-0004-Introduce-read_stream_-pause-resume-yield.patch
- v8-0002-Add-fake-LSN-support-to-nbtree.patch
- v8-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch
pgsql-hackers by date: