Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-WznBuxhvsEgX3mYDjxKhQk9GFdF46vMfE2ugU6SUekHp_A@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Another thing that argues against doing this is that we might not need > > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > > We could end up scanning a whole extra leaf page (including all of its > > tuples) for want of the ability to "push down" a LIMIT to the index AM > > (that's not what happens right now, but it isn't really needed at all > > right now). > > > > I'm not quite sure I understand what is "this" that you argue against. > Are you saying we should not separate the two scans? If yes, is there a > better way to do this? What I'm concerned about is the difficulty and complexity of any design that requires revising "63.4. Index Locking Considerations", since that's pretty subtle stuff. In particular, if prefetching "de-synchronizes" (to use your term) the index leaf page level scan and the heap page scan, then we'll probably have to totally revise the basic API. Maybe that'll actually turn out to be the right thing to do -- it could just be the only thing that can unleash the full potential of prefetching. But I'm not aware of any evidence that points in that direction. Are you? (I might have just missed it.) > The LIMIT problem is not very clear to me either. Yes, if we get close > to the end of the leaf page, we may need to visit the next leaf page. > But that's kinda the whole point of prefetching - reading stuff ahead, > and reading too far ahead is an inherent risk. Isn't that a problem we > have even without LIMIT? The prefetch distance ramp up is meant to limit > the impact. Right now, the index AM doesn't know anything about LIMIT at all. That doesn't matter, since the index AM can only read/scan one full leaf page before returning control back to the executor proper. The executor proper can just shut down the whole index scan upon finding that we've already returned N tuples for a LIMIT N. We don't do prefetching right now, but we also don't risk reading a leaf page that'll just never be needed. Those two things are in tension, but I don't think that that's quite the same thing as the usual standard prefetching tension/problem. Here there is uncertainty about whether what we're prefetching will *ever* be required -- not uncertainty about when exactly it'll be required. (Perhaps this distinction doesn't mean much to you. I'm just telling you how I think about it, in case it helps move the discussion forward.) > > This property of index scans is fundamental to how index scans work. > > Pinning an index page as an interlock against concurrently TID > > recycling by VACUUM is directly described by the index API docs [1], > > even (the docs actually use terms like "buffer pin" rather than > > something more abstract sounding). I don't think that anything > > affecting that behavior should be considered an implementation detail > > of the nbtree index AM as such (nor any particular index AM). > > > > Good point. The main reason why the index AM docs require this interlock is because we need such an interlock to make non-MVCC snapshot scans safe. If you remove the interlock (the buffer pin interlock that protects against TID recycling by VACUUM), you can still avoid the same race condition by using an MVCC snapshot. This is why using an MVCC snapshot is a requirement for bitmap index scans. I believe that it's also a requirement for index-only scans, but the index AM docs don't spell that out. Another factor that complicates things here is mark/restore processing. The design for that has the idea of processing one page at a time baked-in. Kinda like with the kill_prior_tuple issue. It's certainly possible that you could figure out various workarounds for each of these issues (plus the kill_prior_tuple issue) with a prefetching design that "de-synchronizes" the index access and the heap access. But it might well be better to extend the existing design in a way that just avoids all these problems in the first place. Maybe "de-synchronization" really can pay for itself (because the benefits will outweigh these costs), but if you go that way then I'd really prefer it that way. > > I think that it makes sense to put the index AM in control here -- > > that almost follows from what I said about the index AM API. The index > > AM already needs to be in control, in about the same way, to deal with > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > > > In control how? What would be the control flow - what part would be > managed by the index AM? ISTM that prefetching for an index scan is about the index scan itself, first and foremost. The heap accesses are usually the dominant cost, of course, but sometimes the index leaf page accesses really do make up a significant fraction of the overall cost of the index scan. Especially with an expensive index qual. So if you just assume that the TIDs returned by the index scan are the only thing that matters, you might have a model that's basically correct on average, but is occasionally very wrong. That's one reason for "putting the index AM in control". As I said back in June, we should probably be marrying information from the index scan with information from the heap. This is something that is arguably a modularity violation. But it might just be that you really do need to take information from both places to consistently make the right trade-off. Perhaps the best arguments for "putting the index AM in control" only work when you go to fix the problems that "naive de-synchronization" creates. Thinking about that side of things some more might make "putting the index AM in control" seem more natural. Suppose, for example, you try to make a prefetching design based on "de-synchronization" work with kill_prior_tuple -- suppose you try to fix that problem. You're likely going to need to make some kind of trade-off that gets you most of the advantages that that approach offers (assuming that there really are significant advantages), while still retaining most of the advantages that we already get from kill_prior_tuple (basically we want to LP_DEAD-mark index tuples with almost or exactly the same consistency as we manage today). Maybe your approach involves tracking multiple LSNs for each prefetch-pending leaf page, or perhaps you hold on to a pin on some number of leaf pages instead (right now nbtree does both [1], which I go into more below). Either way, you're pushing stuff down into the index AM. Note that we already hang onto more than one pin at a time in rare cases involving mark/restore processing. For example, it can happen for a merge join that happens to involve an unlogged index, if the markpos and curpos are a certain way relative to the current leaf page (yeah, really). So putting stuff like that under the control of the index AM (while also applying basic information that comes from the heap) in order to fix the kill_prior_tuple issue is arguably something that has a kind of a precedent for us to follow. Even if you disagree with me here ("precedent" might be overstating it), perhaps you still get some general sense of why I have an inkling that putting prefetching in the index AM is the way to go. It's very hard to provide one really strong justification for all this, and I'm certainly not expecting you to just agree with me right away. I'm also not trying to impose any conditions on committing this patch. Thinking about this some more, "making kill_prior_tuple work with de-synchronization" is a bit of a misleading way of putting it. The way that you'd actually work around this is (at a very high level) *dynamically* making some kind of *trade-off* between synchronization and desynchronization. Up until now, we've been talking in terms of a strict dichotomy between the old index AM API design (index-page-at-a-time synchronization), and a "de-synchronizing" prefetching design that embraces the opposite extreme -- a design where we only think in terms of heap TIDs, and completely ignore anything that happens in the index structure (and consequently makes kill_prior_tuple ineffective). That now seems like a false dichotomy. > I initially did the prefetching entirely in each index AM, but it was > suggested doing this in the executor would be better. So I gradually > moved it to executor. But the idea to combine this with the streaming > read API seems as a move from executor back to the lower levels ... and > now you're suggesting to make the index AM responsible for this again. I did predict that there'd be lots of difficulties around the layering back in June. :-) > I'm not saying any of those layering options is wrong, but it's not > clear to me which is the right one. I don't claim to know what the right trade-off is myself. The fact that all of these things are in tension doesn't surprise me. It's just a hard problem. > Possible. But AFAIK it did fail for Melanie, and I don't have a very > good explanation for the difference in behavior. If you take a look at _bt_killitems(), you'll see that it actually has two fairly different strategies for avoiding TID recycling race condition issues, applied in each of two different cases: 1. Cases where we really have held onto a buffer pin, per the index AM API -- the "inde AM orthodox" approach. (The aforementioned issue with unlogged indexes exists because with an unlogged index we must use approach 1, per the nbtree README section [1]). 2. Cases where we drop the pin as an optimization (also per [1]), and now have to detect the possibility of concurrent modifications by VACUUM (that could have led to concurrent TID recycling). We conservatively do nothing (don't mark any index tuples LP_DEAD), unless the LSN is exactly the same as it was back when the page was scanned/read by _bt_readpage(). So some accidental detail with LSNs (like using or not using an unlogged index) could cause bugs in this area to "accidentally fail to fail". Since the nbtree index AM has its own optimizations here, which probably has a tendency to mask problems/bugs. (I sometimes use unlogged indexes for some of my nbtree related test cases, just to reduce certain kinds of variability, including variability in this area.) [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=52e646c7f759a5d9cfdc32b86f6aff8460891e12;hb=3e8235ba4f9cc3375b061fb5d3f3575434539b5f#l443 -- Peter Geoghegan
pgsql-hackers by date: