Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-WznkG-v3ioi8mjc4BERi75GL=6GvbP-etPTU4AT=uKMxdw@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra <tomas@vondra.me> wrote: > Attached is an updated version of this patch series. The first couple > parts (adding batching + updating built-in index AMs) remain the same, > the new part is 0007 which switches index scans to read stream API. The first thing that I notice about this patch series is that it doesn't fully remove amgettuple as a concept. That seems a bit odd to me. After all, you've invented a single page batching mechanism, which is duplicative of the single page batching mechanism that each affected index AM has to use already, just to be able to allow the amgettuple interface to iterate backwards and forwards with a scrollable cursor (and to make mark/restore work). ISTM that you have one too many batching interfaces here. I can think of nothing that makes the task of completely replacing amgettuple particularly difficult. I don't think that the need to do the _bt_killitems stuff actually makes this task all that much harder. It will need to be generalized, too, by keeping track of multiple BTScanOpaqueData.killedItems[] style states, each of which is associated with its own page-level currPos state. But that's not rocket science. (Also don't think that mark/restore support is all that hard.) The current way in which _bt_kill_batch() is called from _bt_steppage() by the patch seems weird to me. You're copying what you actually know to be the current page's kill items such that _bt_steppage() will magically do what it does already when the amgetttuple/btgettuple interface is in use, just as we're stepping off the page. It seems to be working at the wrong level. Notice that the current way of doing things in your patch means that your new batching interface tacitly knows about the nbtree batching interface, and that it too works along page boundaries -- that's the only reason why it can hook into _bt_steppage like this in the first place. Things are way too tightly coupled, and the old and new way of doing things are hopelessly intertwined. What's being abstracted away here, really? I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at all -- nor should it even call _bt_killitems(). Things need to be broken down into smaller units of work that can be reordered, instead. The first half of the current _bt_steppage() function deals with finishing off the current leaf page should be moved to some other function -- let's call it _bt_finishpage. A new callback should be called as part of the new API when the time comes to tell nbtree that we're now done with a given leaf page -- that's what this new _bt_finishpage function is for. All that remains of _bt_steppage() are the parts that deal with figuring out which page should be visited next -- the second half of _bt_steppage stays put. That way stepping to the next page and reading multiple pages can be executed as eagerly as makes sense -- we don't need to "coordinate" the heap accesses in lockstep with the leaf page accesses. Maybe you won't take advantage of this flexibility right away, but ISTM that you need nominal support for this kind of reordering to make the new API really make sense. There are some problems with this scheme, but they seem reasonably tractable to me. We already have strategies for dealing with the risk of concurrent TID recycling when _bt_killitems is called with some maybe-recycled TIDs -- we're already dropping the pin on the leaf page early in many cases. I've pointed this out many times already (again, see _bt_drop_lock_and_maybe_pin). It's true that we're still going to have to hold onto a buffer pin on leaf pages whose TIDs haven't all been read from the table AM side yet, unless we know that it's a case where that's safe for other reasons -- otherwise index-only scans might give wrong answers. But that other problem shouldn't be confused with the _bt_killitems problem, just because of the superficial similarity around holding onto a leaf page pin. To repeat: it is important that you not conflate the problems on the table AM side (TID recycle safety for index scans) with the problems on the index AM side (safely setting LP_DEAD bits in _bt_killitems). They're two separate problems that are currently dealt with as one problem on the nbtree side -- but that isn't fundamental. Teasing them apart seems likely to be helpful here. > I speculated that with the batching concept it might work better, and I > think that turned out to be the case. The batching is still the core > idea, giving the index AM enough control to make kill tuples work (by > not generating batches spanning multiple leaf pages, or doing something > smarter). And the read stream leverages that too - the next_block > callback returns items from the current batch, and the stream is reset > between batches. This is the same prefetch restriction as with the > explicit prefetching (done using posix_fadvise), except that the > prefetching is done by the read stream. ISTM that the central feature of the new API should be the ability to reorder certain kinds of work. There will have to be certain constraints, of course. Sometimes these will principally be problems for the table AM (e.g., we musn't allow concurrent TID recycling unless it's for a plain index scan using an MVCC snapshot), other times they're principally problems for the index AM (e.g., the _bt_killitems safety issues). I get that you're not that excited about multi-page batches; it's not the priority. Fair enough. I just think that the API needs to work in terms of batches that are sized as one or more pages, in order for it to make sense. BTW, the README changes you made are slightly wrong about pins and locks. We don't actually keep around C pointers to IndexTuples for index-only scans that point into shared memory -- that won't work. We simply copy whatever IndexTuples the scan returns into local state, associated with so->currPos. So that isn't a complicating factor, at all. That's all I have right now. Hope it helps. -- Peter Geoghegan
pgsql-hackers by date: