Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id CAH2-WznBuxhvsEgX3mYDjxKhQk9GFdF46vMfE2ugU6SUekHp_A@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: index prefetching
List pgsql-hackers
On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> > Another thing that argues against doing this is that we might not need
> > to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> > We could end up scanning a whole extra leaf page (including all of its
> > tuples) for want of the ability to "push down" a LIMIT to the index AM
> > (that's not what happens right now, but it isn't really needed at all
> > right now).
> >
>
> I'm not quite sure I understand what is "this" that you argue against.
> Are you saying we should not separate the two scans? If yes, is there a
> better way to do this?

What I'm concerned about is the difficulty and complexity of any
design that requires revising "63.4. Index Locking Considerations",
since that's pretty subtle stuff. In particular, if prefetching
"de-synchronizes" (to use your term) the index leaf page level scan
and the heap page scan, then we'll probably have to totally revise the
basic API.

Maybe that'll actually turn out to be the right thing to do -- it
could just be the only thing that can unleash the full potential of
prefetching. But I'm not aware of any evidence that points in that
direction. Are you? (I might have just missed it.)

> The LIMIT problem is not very clear to me either. Yes, if we get close
> to the end of the leaf page, we may need to visit the next leaf page.
> But that's kinda the whole point of prefetching - reading stuff ahead,
> and reading too far ahead is an inherent risk. Isn't that a problem we
> have even without LIMIT? The prefetch distance ramp up is meant to limit
> the impact.

Right now, the index AM doesn't know anything about LIMIT at all. That
doesn't matter, since the index AM can only read/scan one full leaf
page before returning control back to the executor proper. The
executor proper can just shut down the whole index scan upon finding
that we've already returned N tuples for a LIMIT N.

We don't do prefetching right now, but we also don't risk reading a
leaf page that'll just never be needed. Those two things are in
tension, but I don't think that that's quite the same thing as the
usual standard prefetching tension/problem. Here there is uncertainty
about whether what we're prefetching will *ever* be required -- not
uncertainty about when exactly it'll be required. (Perhaps this
distinction doesn't mean much to you. I'm just telling you how I think
about it, in case it helps move the discussion forward.)

> > This property of index scans is fundamental to how index scans work.
> > Pinning an index page as an interlock against concurrently TID
> > recycling by VACUUM is directly described by the index API docs [1],
> > even (the docs actually use terms like "buffer pin" rather than
> > something more abstract sounding). I don't think that anything
> > affecting that behavior should be considered an implementation detail
> > of the nbtree index AM as such (nor any particular index AM).
> >
>
> Good point.

The main reason why the index AM docs require this interlock is
because we need such an interlock to make non-MVCC snapshot scans
safe. If you remove the interlock (the buffer pin interlock that
protects against TID recycling by VACUUM), you can still avoid the
same race condition by using an MVCC snapshot. This is why using an
MVCC snapshot is a requirement for bitmap index scans. I believe that
it's also a requirement for index-only scans, but the index AM docs
don't spell that out.

Another factor that complicates things here is mark/restore
processing. The design for that has the idea of processing one page at
a time baked-in. Kinda like with the kill_prior_tuple issue.

It's certainly possible that you could figure out various workarounds
for each of these issues (plus the kill_prior_tuple issue) with a
prefetching design that "de-synchronizes" the index access and the
heap access. But it might well be better to extend the existing design
in a way that just avoids all these problems in the first place. Maybe
"de-synchronization" really can pay for itself (because the benefits
will outweigh these costs), but if you go that way then I'd really
prefer it that way.

> > I think that it makes sense to put the index AM in control here --
> > that almost follows from what I said about the index AM API. The index
> > AM already needs to be in control, in about the same way, to deal with
> > kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> >
>
> In control how? What would be the control flow - what part would be
> managed by the index AM?

ISTM that prefetching for an index scan is about the index scan
itself, first and foremost. The heap accesses are usually the dominant
cost, of course, but sometimes the index leaf page accesses really do
make up a significant fraction of the overall cost of the index scan.
Especially with an expensive index qual. So if you just assume that
the TIDs returned by the index scan are the only thing that matters,
you might have a model that's basically correct on average, but is
occasionally very wrong. That's one reason for "putting the index AM
in control".

As I said back in June, we should probably be marrying information
from the index scan with information from the heap. This is something
that is arguably a modularity violation. But it might just be that you
really do need to take information from both places to consistently
make the right trade-off.

Perhaps the best arguments for "putting the index AM in control" only
work when you go to fix the problems that "naive de-synchronization"
creates. Thinking about that side of things some more might make
"putting the index AM in control" seem more natural.

Suppose, for example, you try to make a prefetching design based on
"de-synchronization" work with kill_prior_tuple -- suppose you try to
fix that problem. You're likely going to need to make some kind of
trade-off that gets you most of the advantages that that approach
offers (assuming that there really are significant advantages), while
still retaining most of the advantages that we already get from
kill_prior_tuple (basically we want to LP_DEAD-mark index tuples with
almost or exactly the same consistency as we manage today). Maybe your
approach involves tracking multiple LSNs for each prefetch-pending
leaf page, or perhaps you hold on to a pin on some number of leaf
pages instead (right now nbtree does both [1], which I go into more
below). Either way, you're pushing stuff down into the index AM.

Note that we already hang onto more than one pin at a time in rare
cases involving mark/restore processing. For example, it can happen
for a merge join that happens to involve an unlogged index, if the
markpos and curpos are a certain way relative to the current leaf page
(yeah, really). So putting stuff like that under the control of the
index AM (while also applying basic information that comes from the
heap) in order to fix the kill_prior_tuple issue is arguably something
that has a kind of a precedent for us to follow.

Even if you disagree with me here ("precedent" might be overstating
it), perhaps you still get some general sense of why I have an inkling
that putting prefetching in the index AM is the way to go. It's very
hard to provide one really strong justification for all this, and I'm
certainly not expecting you to just agree with me right away. I'm also
not trying to impose any conditions on committing this patch.

Thinking about this some more, "making kill_prior_tuple work with
de-synchronization" is a bit of a misleading way of putting it. The
way that you'd actually work around this is (at a very high level)
*dynamically* making some kind of *trade-off* between synchronization
and desynchronization. Up until now, we've been talking in terms of a
strict dichotomy between the old index AM API design
(index-page-at-a-time synchronization), and a "de-synchronizing"
prefetching design that
embraces the opposite extreme -- a design where we only think in terms
of heap TIDs, and completely ignore anything that happens in the index
structure (and consequently makes kill_prior_tuple ineffective). That
now seems like a false dichotomy.

> I initially did the prefetching entirely in each index AM, but it was
> suggested doing this in the executor would be better. So I gradually
> moved it to executor. But the idea to combine this with the streaming
> read API seems as a move from executor back to the lower levels ... and
> now you're suggesting to make the index AM responsible for this again.

I did predict that there'd be lots of difficulties around the layering
back in June.   :-)

> I'm not saying any of those layering options is wrong, but it's not
> clear to me which is the right one.

I don't claim to know what the right trade-off is myself. The fact
that all of these things are in tension doesn't surprise me. It's just
a hard problem.

> Possible. But AFAIK it did fail for Melanie, and I don't have a very
> good explanation for the difference in behavior.

If you take a look at _bt_killitems(), you'll see that it actually has
two fairly different strategies for avoiding TID recycling race
condition issues, applied in each of two different cases:

1. Cases where we really have held onto a buffer pin, per the index AM
API -- the "inde AM orthodox" approach.  (The aforementioned issue
with unlogged indexes exists because with an unlogged index we must
use approach 1, per the nbtree README section [1]).

2. Cases where we drop the pin as an optimization (also per [1]), and
now have to detect the possibility of concurrent modifications by
VACUUM (that could have led to concurrent TID recycling). We
conservatively do nothing (don't mark any index tuples LP_DEAD),
unless the LSN is exactly the same as it was back when the page was
scanned/read by _bt_readpage().

So some accidental detail with LSNs (like using or not using an
unlogged index) could cause bugs in this area to "accidentally fail to
fail". Since the nbtree index AM has its own optimizations here, which
probably has a tendency to mask problems/bugs. (I sometimes use
unlogged indexes for some of my nbtree related test cases, just to
reduce certain kinds of variability, including variability in this
area.)

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=52e646c7f759a5d9cfdc32b86f6aff8460891e12;hb=3e8235ba4f9cc3375b061fb5d3f3575434539b5f#l443
--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: MAINTAIN privilege -- what do we need to un-revert it?
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel