Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-Wz=8kzZCyJ46X9rA-Un1hvjdHj5gDgmOfZ5cRQBvRarJYg@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Wed, Feb 14, 2024 at 7:28 PM Andres Freund <andres@anarazel.de> wrote: > On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote: > > This property of index scans is fundamental to how index scans work. > > Pinning an index page as an interlock against concurrently TID > > recycling by VACUUM is directly described by the index API docs [1], > > even (the docs actually use terms like "buffer pin" rather than > > something more abstract sounding). I don't think that anything > > affecting that behavior should be considered an implementation detail > > of the nbtree index AM as such (nor any particular index AM). > > Given that the interlock is only needed for non-mvcc scans, that non-mvcc > scans are rare due to catalog accesses using snapshots these days and that > most non-mvcc scans do single-tuple lookups, it might be viable to be more > restrictive about prefetching iff non-mvcc snapshots are in use and to use > method of cleanup that allows multiple pages to be cleaned up otherwise. I agree, but don't think that it matters all that much. If you have an MVCC snapshot, that doesn't mean that TID recycle safety problems automatically go away. It only means that you have one known and supported alternative approach to dealing with such problems. It's not like you just get that for free, just by using an MVCC snapshot, though -- it has downsides. Downsides such as the current _bt_killitems() behavior with a concurrently-modified leaf page (modified when we didn't hold a leaf page pin). It'll just give up on setting any LP_DEAD bits due to noticing that the leaf page's LSN changed. (Plus there are implementation restrictions that I won't repeat again now.) When I refer to the buffer pin interlock, I'm mostly referring to the general need for something like that in the context of index scans. Principally in order to make kill_prior_tuple continue to work in something more or less like its current form. > However, I don't think we would necessarily have to relax the IAM pinning > rules, just to be able to do prefetching of more than one index leaf > page. To be clear, we already do relax the IAM pinning rules. Or at least nbtree selectively opts out, as I've gone into already. > Restricting prefetching to entries within a single leaf page obviously > has the disadvantage of not being able to benefit from concurrent IO whenever > crossing a leaf page boundary, but at the same time processing entries from > just two leaf pages would often allow for a sufficiently aggressive > prefetching. Pinning a small number of leaf pages instead of a single leaf > page shouldn't be a problem. You're probably right. I just don't see any need to solve that problem in v1. > One argument for loosening the tight coupling between kill_prior_tuples and > index scan progress is that the lack of kill_prior_tuples for bitmap scans is > quite problematic. I've seen numerous production issues with bitmap scans > caused by subsequent scans processing a growing set of dead tuples, where > plain index scans were substantially slower initially but didn't get much > slower over time. I've seen production issues like that too. No doubt it's a problem. > We might be able to design a system where the bitmap > contains a certain number of back-references to the index, allowing later > cleanup if there weren't any page splits or such. That does seem possible, but do you really want a design for index prefetching that relies on that massive enhancement (a total redesign of kill_prior_tuple) happening at some point in the not-too-distant future? Seems risky, from a project management point of view. This back-references idea seems rather complicated, especially if it needs to work with very large bitmap index scans. Since you'll still have the basic problem of TID recycle safety to deal with (even with an MVCC snapshot), you don't just have to revisit the leaf pages. You also have to revisit the corresponding heap pages (generally they'll be a lot more numerous than leaf pages). You'll have traded one problem for another (which is not to say that it's not a good trade-off). Right now the executor uses a amgettuple interface, and knows nothing about index related costs (e.g., pages accessed in any index, index qual costs). While the index AM has some limited understanding of heap access costs. So the index AM kinda knows a small bit about both types of costs (possibly not enough, but something). That informs the language I'm using to describe all this. To do something like your "back-references to the index" thing well, I think that you need more dynamic behavior around when you visit the heap to get heap tuples pointed to by TIDs from index pages (i.e. dynamic behavior that determines how many leaf pages to go before going to the heap to get pointed-to TIDs). That is basically what I meant by "put the index AM in control" -- it doesn't *strictly* require that the index AM actually do that. Just that a single piece of code has to have access to the full context, in order to make the right trade-offs around how both index and heap accesses are scheduled. > > I think that it makes sense to put the index AM in control here -- > > that almost follows from what I said about the index AM API. The index > > AM already needs to be in control, in about the same way, to deal with > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > Depending on what "control" means I'm doubtful: > > Imo there are decisions influencing prefetching that an index AM shouldn't > need to know about directly, e.g. how the plan shape influences how many > tuples are actually going to be consumed. Of course that determination could > be made in planner/executor and handed to IAMs, for the IAM to then "control" > the prefetching. I agree with all this. -- Peter Geoghegan
pgsql-hackers by date: