Re: index prefetching - Mailing list pgsql-hackers

From Andres Freund
Subject Re: index prefetching
Date
Msg-id coodqlwtyhydncq54uljnk5rerrehbvhokhlfx54qfmdlb4pts@hi4jfb2l4vbj
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: index prefetching
List pgsql-hackers
Hi,

On 2025-11-10 18:59:07 -0500, Peter Geoghegan wrote:
> Tomas and I had a meeting on Friday to discuss a way forward with this
> project. Progress has stalled, and we feel that now is a good time to
> pivot by refactoring the patch into smaller, independently
> useful/committable pieces.

Makes sense.


> Phased approach
> ===============
>
> As touched upon at the start of this email, under this new phased
> approach to the project, the short term goal is to make heapam avoid
> repeat buffer locks during index scans where that's clearly avoidable.
> Making that much work shares many of the same problems with I/O
> prefetching (particularly the basics of layering/AM revisions), but
> defers dealing with the thorniest issues with pin resource management.
> That's what I'll talk about here --- what we can defer, and what we
> cannot defer.
>
> But first, on a more positive note, I'll talk about the short term
> benefits. My early prototype of the "only lock heap buffer once per
> group of TIDs that point to the same heap page returned from an index
> scan" optimization has been shown to improve throughput for large-ish
> range scans by quite a bit. Variants of pgbench select with queries
> like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500"
> show improvements in throughput of up to 20% (and show similar
> reductions in query latency). That's a nice win, all on its own.

Another benfit is that it helps even more when there multiple queries running
concurrently - the high rate of lock/unlock on the buffer rather badly hurts
scalability.

Besides the locking overhead, it turns out that doing visibility checks
one-by-one is a good bit slower than doing so in batches (or for the whole
page). So that's another perf improvement this would enable.


> Now back to talking about risks. There's still a lot of complexity
> that cannot be deferred with this phased approach. We must still
> switch over index AMs from amgettuple to the new amgetbatch interface.
> And, we need to make the table AM interface used by index scans higher
> level: index_getnext_slot would directly call a new table-AM-wise
> callback, just passing it its own ScanDirection argument directly --
> we wouldn't be passing TIDs to the table AM anymore.

> The new table AM batch interface would work in terms of "give me the
> next tuple in the current scan direction", not in terms of "give me
> this random TID, which you know nothing else about". The table AM
> becomes directly aware of the fact that it is participating in an
> ordered index scan. This design is amenable to allowing the table AM
> to see which accesses will be required in the near future -- that
> requirement is common to both I/O prefetching and this other heap
> buffer lock optimization.

Yes, I think that's clearly required.  I think one nice bonus of such a change
is that it'd resolve one of the biggest existing layering violations around
tableam - namely that nodeIndexonlyscan.c does VM_ALL_VISIBLE() calls, which
it really has no business doing.


> It's even more complicated than just those changes to the index AM and
> table AM interfaces: we'll also require that the table AM directly
> interfaces with another layer that manages leaf-page-wise batches on
> its behalf. They need to *cooperate* with each other, to a certain
> degree. The executor proper won't call amgetbatch directly under this
> scheme (it'd just provide a library of routines that help table AMs to
> do so on their own).

> That much doesn't seem deferrable. And it's hard. So this phased
> approach certainly doesn't eliminate project risk, by any stretch of
> the imagination. Offhand, I'd estimate that taking this phased
> approach cuts the number of blockers to making an initial commit in
> half.

I wonder if we could actually do part of the redesign in an even more
piecemeal fashion:

1) Move the responsibility for getting the next tid from the index into
   tableam, but do so by basically using index_getnext_tid().

2) Have the new interface get a single batch of tuples from the index, instead
   of doing it on a single tid-by-tid basis.

3) Have heapam not acquire the page lock for each tuple, but do so for all the
   tuples on the same page.

4) Add awareness of multiple batches

5) Use read stream


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Extended test coverage and docs for SSL passphrase commands
Next
From: Corey Huinker
Date:
Subject: Re: Extended Statistics set/restore/clear functions.