Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id CAH2-Wzkrej9cXjERrA5p8pgD9QfR0LZwCCcgPPu6wiRgFpYVQQ@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: index prefetching
List pgsql-hackers
On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> So, a pin on the index leaf page is sufficient to keep line pointers
> from being reused? If we stick to prefetching heap blocks referred to
> by index tuples in a single index leaf page, and we keep that page
> pinned, will we still have a problem?

That's certainly one way of dealing with it. Obviously, there are
questions about how you do that in a way that consistently avoids
creating new problems.

> I don't think that the LIMIT problem is too different for index scans
> than heap scans. We will need some advice from planner to come down to
> prevent over-eager prefetching in all cases.

I think that I'd rather use information at execution time instead, if
at all possible (perhaps in addition to a hint given by the planner).
But it seems a bit premature to discuss this problem now, except to
say that it might indeed be a problem.

> > It's certainly possible that you could figure out various workarounds
> > for each of these issues (plus the kill_prior_tuple issue) with a
> > prefetching design that "de-synchronizes" the index access and the
> > heap access. But it might well be better to extend the existing design
> > in a way that just avoids all these problems in the first place. Maybe
> > "de-synchronization" really can pay for itself (because the benefits
> > will outweigh these costs), but if you go that way then I'd really
> > prefer it that way.
>
> Forcing each index access to be synchronous and interleaved with each
> table access seems like an unprincipled design constraint. While it is
> true that we rely on that in our current implementation (when using
> non-MVCC snapshots), it doesn't seem like a principle inherent to
> accessing indexes and tables.

There is nothing sacred about the way plain index scans work right now
-- especially the part about buffer pins as an interlock.

If the pin thing really was sacred, then we could never have allowed
nbtree to selectively opt-out in cases where it's possible to provide
an equivalent correctness guarantee without holding onto buffer pins,
which, as I went into, is how it actually works in nbtree's
_bt_killitems() today (see commit 2ed5b87f96 for full details). And so
in principle I have no problem with the idea of revising the basic
definition of plain index scans -- especially if it's to make the
definition more abstract, without fundamentally changing it (e.g., to
make it no longer reference buffer pins, making life easier for
prefetching, while at the same time still implying the same underlying
guarantees sufficient to allow nbtree to mostly work the same way as
today).

All I'm really saying is:

1. The sort of tricks that we can do in nbtree's _bt_killitems() are
quite useful, and ought to be preserved in something like their
current form, even when prefetching is in use.

This seems to push things in the direction of centralizing control of
the process in index scan code. For example, it has to understand that
_bt_killitems() will be called at some regular cadence that is well
defined and sensible from an index point of view.

2. Are you sure that the leaf-page-at-a-time thing is such a huge
hindrance to effective prefetching?

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

3. Even if it is somewhat important, do you really need to get that
part working in v1?

Tomas' original prototype worked with the leaf-page-at-a-time thing,
and that still seemed like a big improvement to me. While being less
invasive, in effect. If we can agree that something like that
represents a useful step in the right direction (not an evolutionary
dead end), then we can make good incremental progress within a single
release.

> I don't think the fact that it would also be valuable to do index
> prefetching is a reason not to do prefetching of heap pages. And,
> while it is true that were you to add index interior or leaf page
> prefetching, it would impact the heap prefetching, at the end of the
> day, the table AM needs some TID or TID-equivalents that whose blocks
> it can go fetch.

I wasn't really thinking of index page prefetching at all. Just the
cost of applying index quals to read leaf pages that might never
actually need to be read, due to the presence of a LIMIT. That is kind
of a new problem created by eagerly reading (without actually
prefetching) leaf pages.

> You could argue that my suggestion to have the index AM manage and
> populate a queue of TIDs for use by the table AM puts the index AM in
> control. I do think having so many members of the IndexScanDescriptor
> which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
> interplay between fetching an index tuple and fetching a heap tuple is
> confusing and error prone.

But that's kinda how amgettuple is supposed to work -- cursors need it
to work that way. Having some kind of general notion of scan order is
also important to avoid returning duplicate TIDs to the scan. In
contrast, GIN heavily relies on the fact that it only supports bitmap
scans -- that allows it to not have to reason about returning
duplicate TIDs (when dealing with a concurrently merged pending list,
and other stuff like that).

And so nbtree (and basically every other index AM that supports plain
index scans) kinda pretends to process a single tuple at a time, in
some fixed order that's convenient for the scan to work with (that's
how the executor thinks of things). In reality these index AMs
actually process batches consisting of a single leaf page worth of
tuples.

I don't see how the IndexScanDescData side of things makes life any
harder for this patch -- ISTM that you'll always need to pretend to
return one tuple at a time from the index scan, regardless of what
happens under the hood, with pins and whatnot. The page-at-a-time
thing is more or less an implementation detail that's private to index
AMs (albeit in a way that follows certain standard conventions across
index AMs) -- it's a leaky abstraction only due to the interactions
with VACUUM/TID recycle safety.

> Re 2: so the LSN could have been changed by some other process (i.e.
> not vacuum), so how often in practice is the LSN actually the same as
> when the page was scanned/read?

It seems very hard to make generalizations about that sort of thing.

It doesn't help that we now have batching logic inside
_bt_simpledel_pass() that will make up for the problem of not setting
as many LP_DEAD bits as we could in many important cases. (I recall
that that was one factor that allowed the bug that Andres fixed in
commit 90c885cd to go undetected for months. I recall discussing the
issue with Andres around that time.)

> Do you think we would catch a
> meaningful number of kill prior tuple opportunities if we used an LSN
> tracking method like this? Something that let us drop the pin on the
> page would obviously be better.

Quite possibly, yes. But it's hard to say for sure without far more
detailed analysis. Plus you have problems with things like unlogged
indexes not having an LSN to use as a canary condition, which makes it
a bit messy (it's already kind of weird that we treat unlogged indexes
differently here IMV).

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: When extended query protocol ends?
Next
From: Nathan Bossart
Date:
Subject: Re: common signal handler protection