Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id CAH2-WznkG-v3ioi8mjc4BERi75GL=6GvbP-etPTU4AT=uKMxdw@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: index prefetching
List pgsql-hackers
On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra <tomas@vondra.me> wrote:
> Attached is an updated version of this patch series. The first couple
> parts (adding batching + updating built-in index AMs) remain the same,
> the new part is 0007 which switches index scans to read stream API.

The first thing that I notice about this patch series is that it
doesn't fully remove amgettuple as a concept. That seems a bit odd to
me. After all, you've invented a single page batching mechanism, which
is duplicative of the single page batching mechanism that each
affected index AM has to use already, just to be able to allow the
amgettuple interface to iterate backwards and forwards with a
scrollable cursor (and to make mark/restore work). ISTM that you have
one too many batching interfaces here.

I can think of nothing that makes the task of completely replacing
amgettuple particularly difficult. I don't think that the need to do
the _bt_killitems stuff actually makes this task all that much harder.
It will need to be generalized, too, by keeping track of multiple
BTScanOpaqueData.killedItems[] style states, each of which is
associated with its own page-level currPos state. But that's not
rocket science. (Also don't think that mark/restore support is all
that hard.)

The current way in which _bt_kill_batch() is called from
_bt_steppage() by the patch seems weird to me. You're copying what you
actually know to be the current page's kill items such that
_bt_steppage() will magically do what it does already when the
amgetttuple/btgettuple interface is in use, just as we're stepping off
the page. It seems to be working at the wrong level.

Notice that the current way of doing things in your patch means that
your new batching interface tacitly knows about the nbtree batching
interface, and that it too works along page boundaries -- that's the
only reason why it can hook into _bt_steppage like this in the first
place. Things are way too tightly coupled, and the old and new way of
doing things are hopelessly intertwined. What's being abstracted away
here, really?

I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at
all -- nor should it even call _bt_killitems(). Things need to be
broken down into smaller units of work that can be reordered, instead.

The first half of the current _bt_steppage() function deals with
finishing off the current leaf page should be moved to some other
function -- let's call it _bt_finishpage. A new callback should be
called as part of the new API when the time comes to tell nbtree that
we're now done with a given leaf page -- that's what this new
_bt_finishpage function is for. All that remains of _bt_steppage() are
the parts that deal with figuring out which page should be visited
next -- the second half of _bt_steppage stays put.

That way stepping to the next page and reading multiple pages can be
executed as eagerly as makes sense -- we don't need to "coordinate"
the heap accesses in lockstep with the leaf page accesses. Maybe you
won't take advantage of this flexibility right away, but ISTM that you
need nominal support for this kind of reordering to make the new API
really make sense.

There are some problems with this scheme, but they seem reasonably
tractable to me. We already have strategies for dealing with the risk
of concurrent TID recycling when _bt_killitems is called with some
maybe-recycled TIDs -- we're already dropping the pin on the leaf page
early in many cases. I've pointed this out many times already (again,
see _bt_drop_lock_and_maybe_pin).

It's true that we're still going to have to hold onto a buffer pin on
leaf pages whose TIDs haven't all been read from the table AM side
yet, unless we know that it's a case where that's safe for other
reasons -- otherwise index-only scans might give wrong answers. But
that other problem shouldn't be confused with the _bt_killitems
problem, just because of the superficial similarity around holding
onto a leaf page pin.

To repeat: it is important that you not conflate the problems on the
table AM side (TID recycle safety for index scans) with the problems
on the index AM side (safely setting LP_DEAD bits in _bt_killitems).
They're two separate problems that are currently dealt with as one
problem on the nbtree side -- but that isn't fundamental. Teasing them
apart seems likely to be helpful here.

> I speculated that with the batching concept it might work better, and I
> think that turned out to be the case. The batching is still the core
> idea, giving the index AM enough control to make kill tuples work (by
> not generating batches spanning multiple leaf pages, or doing something
> smarter). And the read stream leverages that too - the next_block
> callback returns items from the current batch, and the stream is reset
> between batches. This is the same prefetch restriction as with the
> explicit prefetching (done using posix_fadvise), except that the
> prefetching is done by the read stream.

ISTM that the central feature of the new API should be the ability to
reorder certain kinds of work. There will have to be certain
constraints, of course. Sometimes these will principally be problems
for the table AM (e.g., we musn't allow concurrent TID recycling
unless it's for a plain index scan using an MVCC snapshot), other
times they're principally problems for the index AM (e.g., the
_bt_killitems safety issues).

I get that you're not that excited about multi-page batches; it's not
the priority. Fair enough. I just think that the API needs to work in
terms of batches that are sized as one or more pages, in order for it
to make sense.

BTW, the README changes you made are slightly wrong about pins and
locks. We don't actually keep around C pointers to IndexTuples for
index-only scans that point into shared memory -- that won't work. We
simply copy whatever IndexTuples the scan returns into local state,
associated with so->currPos. So that isn't a complicating factor, at
all.

That's all I have right now. Hope it helps.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Rename Function: pg_postmaster_start_time
Next
From: "David G. Johnston"
Date:
Subject: Re: Proposal to remove message length constant from F/B protocol document