Re: index prefetching - Mailing list pgsql-hackers
| From | Peter Geoghegan |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | CAH2-Wz=PR6rUniOwyeo+BcR62fVZU5UR4ci0UH4kZEuaY_td8A@mail.gmail.com Whole thread Raw |
| In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
On Wed, Nov 12, 2025 at 12:39 PM Tomas Vondra <tomas@vondra.me> wrote: > I think I generally agree with what you said here about the challenges, > although it's a bit too abstract to respond to individual parts. I just > don't know how to rework the design to resolve this ... I'm trying to identify which subsets of the existing design can reasonably be committed in a single release (while acknowledging that even those subsets will need to be reworked). That is more abstract than any of us would like -- no question. What are we most confident will definitely be useful to prefetching, that also enables the "only lock heap buffer once per group of TIDs that point to the same heap page returned from an index scan" optimization? I'm trying to reach a tentative agreement that just doing the amgetbatch revisions and the table AM revisions (to do the other heap buffer lock optimization) will represent useful progress that can be committed in a single release. And on what the specifics of the table AM revisions will need to be, to get us to a patch that we can commit to Postgres 19. > For the reads stream "pausing" I think it's pretty clear it's more a > workaround than a desired behavior. We only pause the stream because we > need to limit the look-ahead distance (measured in index leaf pages), > and the read_stream has no such concept. It only knows about heap pins, > but e.g. IOS may need to read many leaf pages to find a single heap page > to prefetch. And the leaf pages are invisible to the stream. Right. But we seemed to talk about this as if the implementation of "pausing" was the problem. I was suggesting that the general idea of pausing might well be the wrong one -- at least when applied in anything like the way we currently apply it. More importantly, I feel that it'll be really hard to get a clear answer to that particular question (and a couple of others like it) without first getting clarity on what we need from the table AM at a high level, API-wise. Bearing in mind that we've made no real progress on that at all. We all agree that it's bad that indexam.c tacitly coordinates with heapam in the way it does in the current patch. And that assuming a TID representation in the API is bad. But that isn't very satisfying to me; it's too focussed on that one really obvious and glaring problem, and what we *don't* want. There's been very little (almost nothing) on this thread about what we actually *do* want. That's the thing that's still way to abstract, that I'd like to make more concrete. As you know, I think that we should add a new table AM interface that makes the table AM directly aware of the fact that it is feeding an ordered index scan, completely avoiding the use of TIDs (as well as avoiding *any* more abstract representation of a table AM tuple identifier). In other words, I think that we should just fully admit the fact that the table AM is in control of the scan, and all that comes with it. The table AM will have to directly coordinate with the index AM in a way that's quite different to what we do right now. I don't think that anybody else has really said much about that idea, at least on the list. Is it a reasonable approach to take? This is really important, especially in the short term/for Postgres 19. > The limit of 64 batches is entirely arbitrary. I needed a number that > would limit the amount of memory and time wasted on useless look-ahead, > and 64 seemed "reasonable" (not too high, but enough to not be hit very > often). Originally there was a fixed-length queue of batches, and 64 was > the capacity, but we no longer do it that way. So it's an imperfect > safety measure against "runaway" streams. Right, but we still max out at 64. And then we stay there. It just feels unprincipled to me. > I don't want to get into too much detail about this particular issue, > it's already discussed somewhere in this thread. But if there was a way > to "tell" the read stream how much effort to spend looking ahead, we > wouldn't do the pausing (not in the end+reset way). I don't want to get into that again either. It was just an example of the kinds of problems we're running into. Though a particularly good example IMV. > > That's all I have for now. My thoughts here should be considered > > tentative; I want to put my thinking on a more rigorous footing before > > really committing to this new phased approach. > > > > I don't object to the "phased approach" with doing the batching first, > but without seeing the code I can't really say if/how much it helps with > resolving the design/layering questions. It feels a bit too abstract to > me. It is in no small part based on gut feeling and intuition. I don't have anything better to go on right now. It's a really difficult project. > While working on the prefetching I moved the code between layers > about three times, and I'm still not quite sure which layer should be > responsible for which piece :-( I don't think that this is quite the same situation. The index prefetching design was completely overhauled twice now, but on both occasions that was driven by some clear goal/the need to fix some problem with the prior design. The first time it was due to the fact that the original version didn't work with kill_prior_tuple. The second time was due to the need to support reading index pages that were ahead of the current page that the scan is returning tuples from. Granted, it took a while to actually prove that the second overhaul (which created the third major redesign) was the right direction to take things in, but testing did eventually make that quite clear. I don't see this as doing the same thing a third time/creating a forth design from scratch. It's more of a refinement (albeit quite a big one) of the most recent design. And in a direction that doesn't seem too surprising to me. We knew that the table AM side of the most recent redesign still had plenty of problems. We should have been a bit more focussed on that side of things earlier on. -- Peter Geoghegan
pgsql-hackers by date: