Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id CAH2-Wz==zkbOumyX-M4quqbX9GCLcjW_zXmdsaK37q-55rj_fQ@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On Thu, Nov 7, 2024 at 4:34 PM Tomas Vondra <tomas@vondra.me> wrote:
> Not sure I understand, but I think I'm somewhat confused by "index AM"
> vs. indexam. Are you suggesting the individual index AMs should know as
> little about the batching as possible, and instead it should be up to
> indexam.c to orchestrate most of the stuff?

Yes, that's what I'm saying. Knowing "as little as possible" turns out
to be pretty close to knowing nothing at all.

There might be some minor exceptions, such as the way that nbtree
needs to remember the scan's array keys. But that already works in a
way that's very insensitive to the exact position in the scan. For
example, right now if you restore a mark that doesn't just come from
the existing so->currPos batch then we cheat and reset the array keys.

> If yes, then I agree in principle, and I think indexam.c is the right
> place to do that (or at least I can't think of a better one).

Good.

> That's what the current patch aimed to do, more or less. I'm not saying
> it got it perfectly right, and I'm sure there is stuff that can be
> improved (like reworking _steppage to not deal with killed tuples). But
> surely the index AMs need to have some knowledge about batching, because
> how else would it know which leaf pages to still keep, etc?

I think that your new thing can directly track which leaf pages have
pins. As well as tracking the order that it has to return tuples from
among those leaf page batch subsets.

Your new thing can think about this in very general terms, that really
aren't tied to any index AM specifics. It'll have some general notion
of an ordered sequence of pages (in scan/key space order), each of
which contains one or more tuples to return. It needs to track which
pages have tuples that we've already done all the required visibility
checks for, in order to be able to instruct the index AM to drop the
pin.

Suppose, for example, that we're doing an SAOP index scan, where the
leaf pages that our multi-page batch consists of aren't direct
siblings. That literally doesn't matter at all. The pages still have
to be in the same familiar key space/scan order, regardless. And that
factor shouldn't really need to influence how many pins we're willing
to hold on to (no more than it would when there are large numbers of
index leaf pages with no interesting tuples to return that we must
still scan over).

> >> 3) It makes it clear when the items are no longer needed, and the AM can
> >> do cleanup. process kill tuples, etc.
> >
> > But it doesn't, really. The index AM is still subject to exactly the
> > same constraints in terms of page-at-a-time processing. These existing
> > constraints always came from the table AM side, so it's not as if your
> > patch can remain totally neutral on these questions.
> >
>
> Not sure I understand. Which part of my sentence you disagree with? Or
> what constraints you mean?

What I was saying here was something I said more clearly a bit further
down: it's technically possible to do multi-page batches within the
confines of the current index AM API, but that's not true in any
practical sense. And it'll never be true with an API that looks very
much like the current amgettuple API.

> The interface does not require page-at-a-time processing - the index AM
> is perfectly within it's rights to produce a batch spanning 10 leaf
> pages, as long as it keeps track of them, and perhaps keeps some mapping
> of items (returned in the batch) to leaf pages. So that when the next
> batch is requested, it can do the cleanup, and move to the next batch.

How does an index AM actually do that in a way that's useful? It only
sees a small part of the picture. That's why it's the wrong place for
it.

> > Basically, it looks like you've invented a shadow batching interface
> > that is technically not known to the index AM, but nevertheless
> > coordinates with the existing so->currPos batching interface.
> >
>
> Perhaps, but which part of that you consider a problem? Are you saying
> this shouldn't use the currPos stuff at all, and instead do stuff in
> some other way?

I think that you should generalize the currPos stuff, and move it to
some other, higher level module.

> Does that mean not having a simple amgetbatch() callback, but some finer
> grained interface? Or maybe one callback that returns the next "AM page"
> (essentially the currPos), and then another callback to release it?
>
> (This is what I mean by "two-callback API" later.)

I'm not sure. Why does the index AM need to care about the batch size
at all? It merely needs to read the next leaf page. The high level
understanding of batches and the leaf pages that constitute batches
lives elsewhere.

The nbtree code will know about buffer pins held, in the sense that
it'll be the one setting the Buffer variables in the new scan
descriptor thing. But it's not going to remember to drop those buffer
pins on its own. It'll need to be told. So it's not ever really in
control.

> > The index AM itself should no longer know about the current next tuple
> > to return, nor about mark/restore. It is no longer directly in control
> > of the scan's progress. It loses all context that survives across API
> > calls.
> >
>
> I'm lost. How could the index AM not know about mark/restore?

Restoring a mark already works by restoring an earlier so->currPos
batch. Actually, more often it works by storing an offset into the
current so->currPos, without actually copying anything into
so->markPos, and without restoring so->markPos into so->currPos.

In short, there is virtually nothing about how mark/restore works that
really needs to live inside nbtree. It's all just restoring an earlier
batch and/or offset into a batch. The only minor caveat is the stuff
about array keys that I went into already -- that isn't quite a piece
of state that lives in so->currPos, but it's a little bit like that.

You can probably poke one or two more minor holes in some of this --
it's not 100% trivial. But it's doable.

> I don't think the batching in various AMs is particularly unique, that's
> true. But my goal was to wrap that in a single amgetbatch callback,
> because that seemed natural, and that moves some of the responsibilities
> to the AM.

Why is it natural? I mean all of the index AMs that support amgettuple
copied everything from ntree already. Including all of the
kill_prior_tuple stuff. It's already quite generic.

> I still don't quite understand what API you imagine, but if
> we want to make more of this the responsibility of indexam.c, I guess it
> will require multiple smaller callbacks (I'm not opposed to that, but I
> also don't know if that's what you imagine).

I think that you understood me correctly here.

> > Most individual calls to btgettuple just return the next batched-up
> > so->currPos tuple/TID via another call to _bt_next. Things like the
> > _bt_first-new-primitive-scan case don't really add any complexity --
> > the core concept of processing a page at a time still applies. It
> > really is just a simple batching scheme, with a couple of extra fiddly
> > details attached to it -- but nothing too hairy.
> >
>
> True, although the details (how the batches are represented etc.) are
> often quite different, so did you imagine some shared structure to
> represent that, or wrapping that in a new callback?

In what sense are they sometimes different?

In general batches will consist of one or more groups of tuples, each
of which is associated with a particular leaf page (if the scan
returns no tuples for a given scanned leaf page then it won't form a
part of the final batch). You can do amgettuple style scrolling back
and forth with this structure, across page boundaries. Seems pretty
general to me.

> Yes, I understand that. Getting the overall design right is my main
> concern, even if some of the advanced stuff is not implemented until
> later. But with the wrong design, that may turn out to be difficult.
>
> That's the feedback I was hoping for when I kept bugging you, and this
> discussion was already very useful in this regard. Thank you for that.

I don't want to insist on doing all this. But it just seems really
weird to have this shadow batching system for the so->currPos batches.

> > The index AM API says that they need to hold on to a buffer pin to
> > avoid confusing scans due to concurrent TID recycling by VACUUM. The
> > index AM API fails to adequately describe what is expected here. And
> > it provides no useful context for larger batching of index pages.
> > nbtree already does its own thing by dropping leaf page pins
> > selectively.
> >
>
> Not sure I understand. I imagined the index AM would just read a
> sequence of leaf pages, keeping all the same pins etc. just like it does
> for the one leaf it reads right now (pins, etc.).

Right. But it wouldn't necessarily drop the leaf pages right away. It
might try to coalesce together multiple heap page accesses, for index
tuples that happen to span page boundaries (but are part of the same
higher level batch).

> I'm probably too dumb for that, but I still don't quite understand how
> that's different from just reading and processing that sequence of leaf
> pages by amgettuple without batching.

It's not so much different, as just more flexible. It's possible that
v1 would effectively do exactly the same thing in practice. It'd only
be able to do fancier things with holding onto leaf pages in a debug
build, that validated the general approach.

> True, although the necessary context could be passed to the index AM in
> some way. That's what happens in the current patch, where indexam.c
> could size the batch just right for a LIMIT clause, before asking the
> index AM to fill it with items.

What difference does it make where it happens? It might make some
difference, but as I keep saying, the important point is that
*somebody* has to know all of these things at the same time.

> I need to think about this a bit, but I agree passing this information
> to an index AM through the kill_prior_tuple seems weird.

Right. Because it's a tuple-at-a-time interface, which isn't suitable
for the direction you want to take things in.

> Hmmm, OK. If the idea is to just return a leaf page as an array of items
> (in some fancy way) to indexam.c, then it'd be indexam.c responsible for
> tracking what the current position (or multiple positions are), I guess.

Right. It would have to have some basic idea of the laws-of-physics
underlying the index scan. It would have to sensibly limit the number
of index page buffer pins held at any given time.

> > Why does the index AM need to know anything about the fact that the
> > next tuple has been requested? Why can't it just be 100% ignorant of
> > all that? (Perhaps barring a few special cases, such as KNN-GiST
> > scans, which continue to use the legacy amgettuple interface.)
> >
>
> Well, I was thinking about how it works now, for the "current" position.
> And I was thinking about how would it need to change to handle the
> prefetch position too, in the same way ...
>
> But if you're suggesting to move this logic and context to the upper
> layer indexam.c, that changes things ofc.

Yes, I am suggesting that.

> Well, if the AM works with "batches of tuples for a leaf page" (through
> the two callbacks to read / release a page), then positions to exact
> items are no longer needed. It just needs to know which pages are still
> needed, etc. Correct?

Right, correct.

> > I don't think that it'll ultimately be all that hard to schedule when
> > and how index pages are read from outside of the index AM in question.
> > In general all relevant index AMs already work in much the same way
> > here. Maybe we can ultimately invent a way for the index AM to
> > influence that scheduling, but that might never be required.
> >
>
> I haven't thought about scheduling at all. Maybe there's something we
> could improve in the future, but I don't see what would it look like,
> and it seems unrelated to this patch.

It's only related to this patch in the sense that we have to imagine
that it'll be worth having in some form in the future.

It might also be a good exercise architecturally. We don't need to do
the same thing in several slightly different ways in each index AM.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andy Fan
Date:
Subject: Re: Deleting older versions in unique indexes to avoid page splits
Next
From: Bear Giles
Date:
Subject: pgcrypto: better memory management?...