Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-Wzkrej9cXjERrA5p8pgD9QfR0LZwCCcgPPu6wiRgFpYVQQ@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > So, a pin on the index leaf page is sufficient to keep line pointers > from being reused? If we stick to prefetching heap blocks referred to > by index tuples in a single index leaf page, and we keep that page > pinned, will we still have a problem? That's certainly one way of dealing with it. Obviously, there are questions about how you do that in a way that consistently avoids creating new problems. > I don't think that the LIMIT problem is too different for index scans > than heap scans. We will need some advice from planner to come down to > prevent over-eager prefetching in all cases. I think that I'd rather use information at execution time instead, if at all possible (perhaps in addition to a hint given by the planner). But it seems a bit premature to discuss this problem now, except to say that it might indeed be a problem. > > It's certainly possible that you could figure out various workarounds > > for each of these issues (plus the kill_prior_tuple issue) with a > > prefetching design that "de-synchronizes" the index access and the > > heap access. But it might well be better to extend the existing design > > in a way that just avoids all these problems in the first place. Maybe > > "de-synchronization" really can pay for itself (because the benefits > > will outweigh these costs), but if you go that way then I'd really > > prefer it that way. > > Forcing each index access to be synchronous and interleaved with each > table access seems like an unprincipled design constraint. While it is > true that we rely on that in our current implementation (when using > non-MVCC snapshots), it doesn't seem like a principle inherent to > accessing indexes and tables. There is nothing sacred about the way plain index scans work right now -- especially the part about buffer pins as an interlock. If the pin thing really was sacred, then we could never have allowed nbtree to selectively opt-out in cases where it's possible to provide an equivalent correctness guarantee without holding onto buffer pins, which, as I went into, is how it actually works in nbtree's _bt_killitems() today (see commit 2ed5b87f96 for full details). And so in principle I have no problem with the idea of revising the basic definition of plain index scans -- especially if it's to make the definition more abstract, without fundamentally changing it (e.g., to make it no longer reference buffer pins, making life easier for prefetching, while at the same time still implying the same underlying guarantees sufficient to allow nbtree to mostly work the same way as today). All I'm really saying is: 1. The sort of tricks that we can do in nbtree's _bt_killitems() are quite useful, and ought to be preserved in something like their current form, even when prefetching is in use. This seems to push things in the direction of centralizing control of the process in index scan code. For example, it has to understand that _bt_killitems() will be called at some regular cadence that is well defined and sensible from an index point of view. 2. Are you sure that the leaf-page-at-a-time thing is such a huge hindrance to effective prefetching? I suppose that it might be much more important than I imagine it is right now, but it'd be nice to have something a bit more concrete to go on. 3. Even if it is somewhat important, do you really need to get that part working in v1? Tomas' original prototype worked with the leaf-page-at-a-time thing, and that still seemed like a big improvement to me. While being less invasive, in effect. If we can agree that something like that represents a useful step in the right direction (not an evolutionary dead end), then we can make good incremental progress within a single release. > I don't think the fact that it would also be valuable to do index > prefetching is a reason not to do prefetching of heap pages. And, > while it is true that were you to add index interior or leaf page > prefetching, it would impact the heap prefetching, at the end of the > day, the table AM needs some TID or TID-equivalents that whose blocks > it can go fetch. I wasn't really thinking of index page prefetching at all. Just the cost of applying index quals to read leaf pages that might never actually need to be read, due to the presence of a LIMIT. That is kind of a new problem created by eagerly reading (without actually prefetching) leaf pages. > You could argue that my suggestion to have the index AM manage and > populate a queue of TIDs for use by the table AM puts the index AM in > control. I do think having so many members of the IndexScanDescriptor > which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous > interplay between fetching an index tuple and fetching a heap tuple is > confusing and error prone. But that's kinda how amgettuple is supposed to work -- cursors need it to work that way. Having some kind of general notion of scan order is also important to avoid returning duplicate TIDs to the scan. In contrast, GIN heavily relies on the fact that it only supports bitmap scans -- that allows it to not have to reason about returning duplicate TIDs (when dealing with a concurrently merged pending list, and other stuff like that). And so nbtree (and basically every other index AM that supports plain index scans) kinda pretends to process a single tuple at a time, in some fixed order that's convenient for the scan to work with (that's how the executor thinks of things). In reality these index AMs actually process batches consisting of a single leaf page worth of tuples. I don't see how the IndexScanDescData side of things makes life any harder for this patch -- ISTM that you'll always need to pretend to return one tuple at a time from the index scan, regardless of what happens under the hood, with pins and whatnot. The page-at-a-time thing is more or less an implementation detail that's private to index AMs (albeit in a way that follows certain standard conventions across index AMs) -- it's a leaky abstraction only due to the interactions with VACUUM/TID recycle safety. > Re 2: so the LSN could have been changed by some other process (i.e. > not vacuum), so how often in practice is the LSN actually the same as > when the page was scanned/read? It seems very hard to make generalizations about that sort of thing. It doesn't help that we now have batching logic inside _bt_simpledel_pass() that will make up for the problem of not setting as many LP_DEAD bits as we could in many important cases. (I recall that that was one factor that allowed the bug that Andres fixed in commit 90c885cd to go undetected for months. I recall discussing the issue with Andres around that time.) > Do you think we would catch a > meaningful number of kill prior tuple opportunities if we used an LSN > tracking method like this? Something that let us drop the pin on the > page would obviously be better. Quite possibly, yes. But it's hard to say for sure without far more detailed analysis. Plus you have problems with things like unlogged indexes not having an LSN to use as a canary condition, which makes it a bit messy (it's already kind of weird that we treat unlogged indexes differently here IMV). -- Peter Geoghegan
pgsql-hackers by date: