Re: index prefetching - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | 07783ccc-0c85-4c0c-817f-faa5a2c38e64@vondra.me Whole thread Raw |
| In response to | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
| Responses |
Re: index prefetching
|
| List | pgsql-hackers |
On 11/11/25 00:59, Peter Geoghegan wrote: > On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg@bowt.ie> wrote: >> Nothing really new here (I've been >> working on batching on the table AM side, but nothing to show on that >> just yet). > > Tomas and I had a meeting on Friday to discuss a way forward with this > project. Progress has stalled, and we feel that now is a good time to > pivot by refactoring the patch into smaller, independently > useful/committable pieces. This email explains our current thinking > (Tomas should correct me if I get anything wrong here). > > The short version/executive summary > =================================== > > The need to get everything done in one single release seems to be > hampering progress. We made quick progress for a few months, but now > that we've exhausted the easy wins, the layering issues that remain > are making every remaining open item near intractable. > > The layering issues make it very hard to keep on top of all of the > regressions; we're just doing too much at once. We're trying to manage > all of the regressions from the addition of prefetching/a heapam read > stream, while also trying to manage the regressions from moving index > AMs from the old amgettuple interface to the new amgetbatch interface. > And we still need to revise the table AM to move the read stream from > indexam.c over to the table AM side (this isn't in the latest version > of the patch at all). > > Just making these AM interface changes is already a huge project on > its own. This makes it hard to focus on just a few things at any one > time; everything is interdependent. We seem to end up playing > whack-a-mole whenever we try to zero in on any single problem; we end > up going in circles. > > The new tentative plan is to cut scope by focussing on switching over > to the new index AM + table AM interface from the patch in the short > term, for Postgres 19. There is an almost immediate benefit to just > doing that much, unrelated to I/O prefetching for index scans: it > enables batching of heap page buffer locking/unlocking (during the > point where index scans perform heap_hot_search_buffer calls) on the > table AM/heapam side during ordered index scans. That can dramatically > cut down on repeat buffer locking and unlocking, giving us enough of a > win (more details below) to be the sole justification for switching > over to the new set of AM interfaces for Postgres 19. > > Our long term goals won't change under this phased approach, but our > timeline/short term focus certainly will. We hope to get some general > feedback about this new strategy for the project now, particularly > from Andres. The main practical concern is managing project risk > sensibly. > > Difficulties with refactoring AM interfaces while introducing a read stream > =========================================================================== > > The uncertainty about how to resolve some of the remaining individual > open items for the project (specifically concerns about I/O > prefetching/read stream + resource management concerns, and how they > *interact* with broader layering questions) is the main blocker to > further progress. I'll now give a specific example of what I mean by > this, just because it's likely to be clearer than explaining the > underlying problem in general terms. > > Currently, we can have up to 64 leaf-page-wise batches. Usually this > is more than enough, but occasionally we run out of space for batches, > and have to reset the read stream. This is certainly a kludge; we > discard pinned buffers with useful data in order to work around what > we've thought of as an implementation deficiency on the read stream > side up until now. Obviously just discarding useful work like that is > never okay (nobody will argue with me on this, I'm sure). > > At various points we talked about addressing this particular problem > by teaching the read stream to "pause" such that we can consume those > remaining pinned buffers as needed, without consuming even more heap > pages/buffers to do so (there's no natural upper bound on those, I > think). We'd then "unpause" and resume prefetching again, once we > managed to free some more leaf-page-wise batches up. But I'm now > starting to have serious doubts about this approach (or at least > doubts about the approach that I think other people have in mind when > they talk about this kind of "pausing"). > > Again, it's really hard to pin down *where* we should be fixing things. > > It occurs to me that it doesn't make much sense that the table > AM/indexam.c has *no* awareness of how many heap buffers are already > pinned on its behalf. The fact that that knowledge is *exclusively* > confined to the read stream isn't actually good. What we really need > to do is to care about all buffer pins held by the whole index scan > node, whether for index pages or for heap pages (though note that > holding onto buffer pins on index pages should be rare in practice). > We need to directly acknowledge the tension that exists between heapam > and index AM needs, I think. > > The read stream needs to be involved in this process, but it should be > a 2-way conversation. The read stream already defensively checks > externally held buffer pins, which might kinda work for what we have > in mind -- but probably not. It seems bad to depend on what is > supposed to be a defensive measure for all this. > > Separately, we'll probably eventually want the heapam side to be able > to notice that a block number that it requests is already in the > pending list, so that it can be marked as a duplicate (and so not > unpinned until the duplicate request is also satisfied/has its heap > tuples returned to the scan). That's another factor pushing things in > this general direction. (Less important, but noted here for > completeness.) > > I've been talking about problems when 64 leaf-page-wise batches isn't > enough, which is rare in practice. It's far more common for 64 to be > too *many* batches, which wastes memory (e.g, with largely sequential > heap access we seem to need no more than 5 or 10 at a time, even when > prefetching is really important). But it's hard to see how we could > lazily allocate memory used for batches under anything like the > current structure. It's circular: we should only allocate more > leaf-page-wise batches to make it possible to do more useful heap > prefetching. But right now heap prefetching will stall (or will > "pause" in its own kludgey way) precisely because there aren't enough > leaf-page-wise batches! > > Granted, adding a "pausing" capability might be useful elsewhere. But > that in itself doesn't justify the general idea of pausing in the > specific way that index prefetching requires. Why should it? > > Why should we pause when we've filled 64 leaf-page-wise batches > instead of 5 or 10 or 1000? ISTM that we're tacitly assuming that the > total number of usable leaf-page-wise batches remaining is a useful > proxy for the costs that actually matter. But why should it be? 64 is > just a number that we picked fairly arbitrarily, and one that has only > a weak relationship with more concrete costs such as leaf page buffer > pins held (as noted already, needing to hold onto a leaf page buffer > pin until we call btfreebatch against its batch isn't actually needed > during most index scans, but there will be exceptions). > > My gut instinct is that this stuff will actually matter, in practice, > at least some of the time. And that that'll necessitate giving the > implementation a clear and complete picture of costs and benefits when > scheduling index scans that prefetch. Pausing can't be based on some > randomly chosen magic number, like 64, since that's bound to be > totally wrong in a nonzero number of cases. > > ISTM that we cannot subordinate the table AM to the read stream. But > we also can't subordinate the read stream to the table AM. Figuring > all that out is hard. This is the kind of problem that we'd like to > defer for now. > > Minor caveat: I'm not sure that Tomas endorses everything I've said > here about "pausing" the read stream. But that probably doesn't matter > much. Either way, these kinds of questions still weigh on the project, > and something should be done about it now, to keep things on course. > I think I generally agree with what you said here about the challenges, although it's a bit too abstract to respond to individual parts. I just don't know how to rework the design to resolve this ... For the reads stream "pausing" I think it's pretty clear it's more a workaround than a desired behavior. We only pause the stream because we need to limit the look-ahead distance (measured in index leaf pages), and the read_stream has no such concept. It only knows about heap pins, but e.g. IOS may need to read many leaf pages to find a single heap page to prefetch. And the leaf pages are invisible to the stream. The limit of 64 batches is entirely arbitrary. I needed a number that would limit the amount of memory and time wasted on useless look-ahead, and 64 seemed "reasonable" (not too high, but enough to not be hit very often). Originally there was a fixed-length queue of batches, and 64 was the capacity, but we no longer do it that way. So it's an imperfect safety measure against "runaway" streams. I don't want to get into too much detail about this particular issue, it's already discussed somewhere in this thread. But if there was a way to "tell" the read stream how much effort to spend looking ahead, we wouldn't do the pausing (not in the end+reset way). > Phased approach > =============== > > As touched upon at the start of this email, under this new phased > approach to the project, the short term goal is to make heapam avoid > repeat buffer locks during index scans where that's clearly avoidable. > Making that much work shares many of the same problems with I/O > prefetching (particularly the basics of layering/AM revisions), but > defers dealing with the thorniest issues with pin resource management. > That's what I'll talk about here --- what we can defer, and what we > cannot defer. > > But first, on a more positive note, I'll talk about the short term > benefits. My early prototype of the "only lock heap buffer once per > group of TIDs that point to the same heap page returned from an index > scan" optimization has been shown to improve throughput for large-ish > range scans by quite a bit. Variants of pgbench select with queries > like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500" > show improvements in throughput of up to 20% (and show similar > reductions in query latency). That's a nice win, all on its own. > > Now back to talking about risks. There's still a lot of complexity > that cannot be deferred with this phased approach. We must still > switch over index AMs from amgettuple to the new amgetbatch interface. > And, we need to make the table AM interface used by index scans higher > level: index_getnext_slot would directly call a new table-AM-wise > callback, just passing it its own ScanDirection argument directly -- > we wouldn't be passing TIDs to the table AM anymore. > > The new table AM batch interface would work in terms of "give me the > next tuple in the current scan direction", not in terms of "give me > this random TID, which you know nothing else about". The table AM > becomes directly aware of the fact that it is participating in an > ordered index scan. This design is amenable to allowing the table AM > to see which accesses will be required in the near future -- that > requirement is common to both I/O prefetching and this other heap > buffer lock optimization. > > It's even more complicated than just those changes to the index AM and > table AM interfaces: we'll also require that the table AM directly > interfaces with another layer that manages leaf-page-wise batches on > its behalf. They need to *cooperate* with each other, to a certain > degree. The executor proper won't call amgetbatch directly under this > scheme (it'd just provide a library of routines that help table AMs to > do so on their own). > > That much doesn't seem deferrable. And it's hard. So this phased > approach certainly doesn't eliminate project risk, by any stretch of > the imagination. Offhand, I'd estimate that taking this phased > approach cuts the number of blockers to making an initial commit in > half. > > Here's a nonexhaustive list of notable pain points that *won't* need > to be addressed in the short term, under this new approach/structure > (I'm somewhat repeating myself here): > > * Most regressions are likely much easier to avoid/are automatically > avoided. Particularly with selective point query scans. > > * No need to integrate the read stream, no need to solve most resource > management problems (the prior item about regressions is very much > related to this one). > > * No need for streamPos stuff when iterating through TIDs from a > leaf-page-wise batch (only need readPos now). There's no need to keep > those 2 things in sync, because there'll only be 1 thing now. > > Here's a nonexhaustive list of problems that we *will* still need to > solve in the earliest committed patch, under this phased approach > (again, I'm repeating myself somewhat): > > * Actually integrating the amgetbatch interface in a way that is future-proof. > > * Revising the table AM interface such that the table AM is directly > aware of the fact that it is feeding heap/table tuples to an ordered > index scan. That's a big conceptual shift for table AMs. > > * Making the prior 2 changes "fit together" sensibly, in a way that > considers current and future needs. Also a big shift. > > The "only lock heap buffer once per group of TIDs that point to the > same heap page returned from an index scan" optimization still > requires some general awareness of index AM costs on the table AM > side. > > It only makes sense for us to batch-up extra TIDs (from the same heap > page) when determining which TIDs are about to be accessed as a group > isn't too expensive/the information is readily available to the table > AM, because it requested it from the index AM itself. We're setting a > new precedent by saying that it's okay to share certain knowledge > across what we previously thought of as strictly separate layers of > abstraction. I think that that makes sense (what else could possibly > work?), but I want to draw specific attention to that now. > > * We'll still need index-only scans to do things in a way that > prevents inconsistencies/changing our mind in terms of which TIDs are > all-visible. > > This has the advantage of allowing us to avoid accessing the > visibility map from the executor proper, which is an existing > modularity violation that we already agree ought to be fixed. This > will also keep us honest (we won't be deferring more than we should). > But that's not why I think it's essential to move VM accesses into the > table AM. > > We should only batch together accesses to a heap page when we know for > sure that those TIDs will in fact be accessed. How are we supposed to > have general and robust handling for all that, in a world where the > visibility map continues to be accessed from the executor proper? At > best, not handling VM integration comprehensively (for index-only > scans) ties our hands around reordering work, and seems like it'd be > very brittle. It would likely have similar problems to our current > problems with managing a read stream in indexam.c, while relying on > tacit knowledge of how precisely those same heap blocks will later > actually be accessed from the heapam side. > > The sensible solution is to put control of the scan's progress all in > one place. We don't want to have to worry about what happens when the > VM is concurrently set or unset. > > When Andres and Tomas talk about table AM modularity stuff, they tend > to focus on why it's bad that the table AM interface uses heap TIDs > specifically. I agree with all that. But even if I didn't, everything > that I just said about the need to centralize control in the table AM > would still be true. That's why I'm focussing on that here (it's > really pretty subtle). > > That's all I have for now. My thoughts here should be considered > tentative; I want to put my thinking on a more rigorous footing before > really committing to this new phased approach. > I don't object to the "phased approach" with doing the batching first, but without seeing the code I can't really say if/how much it helps with resolving the design/layering questions. It feels a bit too abstract to me. While working on the prefetching I moved the code between layers about three times, and I'm still not quite sure which layer should be responsible for which piece :-( regards -- Tomas Vondra
pgsql-hackers by date: