Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 225323b8-2c93-4784-8616-c58931a776bc@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: index prefetching
List pgsql-hackers
On 11/7/24 01:38, Peter Geoghegan wrote:
> On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Attached is an updated version of this patch series. The first couple
>> parts (adding batching + updating built-in index AMs) remain the same,
>> the new part is 0007 which switches index scans to read stream API.
> 
> The first thing that I notice about this patch series is that it
> doesn't fully remove amgettuple as a concept. That seems a bit odd to
> me. After all, you've invented a single page batching mechanism, which
> is duplicative of the single page batching mechanism that each
> affected index AM has to use already, just to be able to allow the
> amgettuple interface to iterate backwards and forwards with a
> scrollable cursor (and to make mark/restore work). ISTM that you have
> one too many batching interfaces here.
> 
> I can think of nothing that makes the task of completely replacing
> amgettuple particularly difficult. I don't think that the need to do
> the _bt_killitems stuff actually makes this task all that much harder.
> It will need to be generalized, too, by keeping track of multiple
> BTScanOpaqueData.killedItems[] style states, each of which is
> associated with its own page-level currPos state. But that's not
> rocket science. (Also don't think that mark/restore support is all
> that hard.)
> 

The primary reason why I kept amgettuple() as is, and added a new AM
callback for the "batch" mode is backwards compatibility. I did not want
to force all AMs to do this, I think it should be optional. Not only to
limit the disruption for out-of-core AMs, but also because I'm not 100%
sure every AM will be able to do batching in a reasonable way.

I do agree having an AM-level batching, and then another batching in the
indexam.c is a bit ... weird. To some extent this is a remainder of an
earlier patch version, but it's also based on some suggestions by Andres
about batching these calls into AM for efficiency reasons. To be fair, I
was jetlagged and I'm not 100% sure this is what he meant, or that it
makes a difference in practice.

Yes, we could ditch the batching in indexam.c, and just rely on the AM
batching, just like now. There are a couple details why the separate
batching seemed convenient:

1) We may need to stash some custom data for each TID (e.g. so that IOS
does not need to check VM repeatedly). But perhaps that could be
delegated to the index AM too ...

2) We need to maintain two "positions" in the index. One for the item
the executor is currently processing (and which might end up getting
marked as "killed" etc). And another one for "read" position, i.e. items
passed to the read stream API / prefetching, etc.

3) It makes it clear when the items are no longer needed, and the AM can
do cleanup. process kill tuples, etc.


> The current way in which _bt_kill_batch() is called from
> _bt_steppage() by the patch seems weird to me. You're copying what you
> actually know to be the current page's kill items such that
> _bt_steppage() will magically do what it does already when the
> amgetttuple/btgettuple interface is in use, just as we're stepping off
> the page. It seems to be working at the wrong level.
> 

True, but that's how it was working before, it wasn't my ambition to
rework that.

> Notice that the current way of doing things in your patch means that
> your new batching interface tacitly knows about the nbtree batching
> interface, and that it too works along page boundaries -- that's the
> only reason why it can hook into _bt_steppage like this in the first
> place. Things are way too tightly coupled, and the old and new way of
> doing things are hopelessly intertwined. What's being abstracted away
> here, really?
> 

I'm not sure sure if by "new batching interface" you mean the indexam.c
code, or the code in btgetbatch() etc.

I don't think indexam.c knows all that much about the nbtree internal
batching. It "just" relies on amgetbatch() producing items the AM can
handle later (during killtuples/cleanup etc.). It does not even need to
be a single-leaf-page batch, if the AM knows how to track/deal with that
internally. It just was easier to do by restricting to a single leaf
page for now. But that's internal to AM.

Yes, it's true inside the AM it's more intertwined, and some of it sets
things up so that the existing code does the right thing ...

> I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at
> all -- nor should it even call _bt_killitems(). Things need to be
> broken down into smaller units of work that can be reordered, instead.
> 
> The first half of the current _bt_steppage() function deals with
> finishing off the current leaf page should be moved to some other
> function -- let's call it _bt_finishpage. A new callback should be
> called as part of the new API when the time comes to tell nbtree that
> we're now done with a given leaf page -- that's what this new
> _bt_finishpage function is for. All that remains of _bt_steppage() are
> the parts that deal with figuring out which page should be visited
> next -- the second half of _bt_steppage stays put.
> 
> That way stepping to the next page and reading multiple pages can be
> executed as eagerly as makes sense -- we don't need to "coordinate"
> the heap accesses in lockstep with the leaf page accesses. Maybe you
> won't take advantage of this flexibility right away, but ISTM that you
> need nominal support for this kind of reordering to make the new API
> really make sense.
> 

Yes, splitting _bt_steppage() like this makes sense to me, and I agree
being able to proceed to the next page before we're done with the
current page seems perfectly reasonable for batches spanning multiple
leaf pages.

> There are some problems with this scheme, but they seem reasonably
> tractable to me. We already have strategies for dealing with the risk
> of concurrent TID recycling when _bt_killitems is called with some
> maybe-recycled TIDs -- we're already dropping the pin on the leaf page
> early in many cases. I've pointed this out many times already (again,
> see _bt_drop_lock_and_maybe_pin).
> 
> It's true that we're still going to have to hold onto a buffer pin on
> leaf pages whose TIDs haven't all been read from the table AM side
> yet, unless we know that it's a case where that's safe for other
> reasons -- otherwise index-only scans might give wrong answers. But
> that other problem shouldn't be confused with the _bt_killitems
> problem, just because of the superficial similarity around holding
> onto a leaf page pin.
> 
> To repeat: it is important that you not conflate the problems on the
> table AM side (TID recycle safety for index scans) with the problems
> on the index AM side (safely setting LP_DEAD bits in _bt_killitems).
> They're two separate problems that are currently dealt with as one
> problem on the nbtree side -- but that isn't fundamental. Teasing them
> apart seems likely to be helpful here.
> 

Hmm. I've intentionally tried to ignore these issues, or rather to limit
the scope of the patch so that v1 does not require dealing with it.
Hence the restriction to single-leaf batches, for example.

But I guess I may have to look at this after all ... not great.

>> I speculated that with the batching concept it might work better, and I
>> think that turned out to be the case. The batching is still the core
>> idea, giving the index AM enough control to make kill tuples work (by
>> not generating batches spanning multiple leaf pages, or doing something
>> smarter). And the read stream leverages that too - the next_block
>> callback returns items from the current batch, and the stream is reset
>> between batches. This is the same prefetch restriction as with the
>> explicit prefetching (done using posix_fadvise), except that the
>> prefetching is done by the read stream.
> 
> ISTM that the central feature of the new API should be the ability to
> reorder certain kinds of work. There will have to be certain
> constraints, of course. Sometimes these will principally be problems
> for the table AM (e.g., we musn't allow concurrent TID recycling
> unless it's for a plain index scan using an MVCC snapshot), other
> times they're principally problems for the index AM (e.g., the
> _bt_killitems safety issues).
> 

Not sure. By "new API" you mean the read stream API, or the index AM API
to allow batching?

> I get that you're not that excited about multi-page batches; it's not
> the priority. Fair enough. I just think that the API needs to work in
> terms of batches that are sized as one or more pages, in order for it
> to make sense.
> 

True, but isn't that already the case? I mean, what exactly prevents an
index AM to "build" a batch for multiple leaf pages? The current patch
does not implement that for any of the AMs, true, but isn't that already
possible if the AM chooses to?

If you were to design the index AM API to support this (instead of
adding the amgetbatch callback etc.), how would it look?

In one of the previous patch versions I tried to rely on amgettuple().
It got a bunch of TIDs ahead from that, depending on prefetch distance.
Then those TIDs were prefetched/passed to the read stream, and stashed
in a queue (in IndexScanDesc). And then indexam would get the TIDs from
the queue, and pass them to index scans etc.

Unfortunately that didn't work because of killtuples etc. because the
index AM had no idea about the indexam queue and has it's own concept of
"current item", so it was confused about which item to mark as killed.
And that old item might even be from an earlier leaf page (not the
"current" currPos).

I was thinking maybe the AM could keep the leaf pages, and then free
them once they're no longer needed. But it wasn't clear to me how to
exchange this information between indexam.c and the index AM, because
right now the AM only knows about a single (current) position.


But imagine we have this:

a) A way to switch the scan into "batch" mode, where the AM keeps the
leaf page (and a way for the AM to indicate it supports this).

b) Some way to track two "positions" in the scan - one for read, one for
prefetch. I'm not sure if this would be internal in each index AM, or at
the indexam.c level.

c) A way to get the index tuple for either of the two positions (and
advance the position). It might be a flag for amgettuple(), or maybe
even a callaback for the "prefetch" position.

d) A way to inform the AM items up to some position are no longer
needed, and thus the leaf pages can be cleaned up and freed. AFAICS it
could always be "up to the current read position".

Does that sound reasonable / better than the current approach, or have I
finally reached the "raving lunatic" stage?


> BTW, the README changes you made are slightly wrong about pins and
> locks. We don't actually keep around C pointers to IndexTuples for
> index-only scans that point into shared memory -- that won't work. We
> simply copy whatever IndexTuples the scan returns into local state,
> associated with so->currPos. So that isn't a complicating factor, at
> all.
> 

Ah, OK. Thanks for the correction.


> That's all I have right now. Hope it helps.
> 

Yes, very interesting insights. Thanks!


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Junwang Zhao
Date:
Subject: Re: general purpose array_sort
Next
From: Peter Eisentraut
Date:
Subject: Re: Adding NetBSD and OpenBSD to Postgres CI