Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id CAH2-WzmDWjbVAD26HTQNjyY5Vxj74fAZMcDNNt8KJaycpr7dAA@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Andres Freund <andres@anarazel.de>)
Responses Re: index prefetching
List pgsql-hackers
On Fri, Nov 21, 2025 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:
> Another benfit is that it helps even more when there multiple queries running
> concurrently - the high rate of lock/unlock on the buffer rather badly hurts
> scalability.

I haven't noticed that effect myself. In fact, it seemed to be the
other way around; it looked like it helped most with very low client
count workloads.

It's possible that that had something to do with my hacky approach to
validating the general idea of optimizing heapam buffer
locking/avoiding repeated locking. This was a very rough prototype.

> Besides the locking overhead, it turns out that doing visibility checks
> one-by-one is a good bit slower than doing so in batches (or for the whole
> page). So that's another perf improvement this would enable.

Isn't that just what you automatically get by only locking once per
contiguous group of TIDs that all point to the same heap page?

Or did you mean that we could do the visibility checks in a separate
pass, or something like that? I know that we do something like that
when pruning these days, but that seems quite different.

> Yes, I think that's clearly required.  I think one nice bonus of such a change
> is that it'd resolve one of the biggest existing layering violations around
> tableam - namely that nodeIndexonlyscan.c does VM_ALL_VISIBLE() calls, which
> it really has no business doing.

Right. One relevant artefact of that layering violation is the way
that it forces index I/O prefetching (as implemented in the current
draft patch) to cache visibility lookup info. But with an I/O
prefetching design that puts exactly one place (namely the new table
AM index scan implementation) in charge of everything, that is no
longer necessary.

> I wonder if we could actually do part of the redesign in an even more
> piecemeal fashion:
>
> 1) Move the responsibility for getting the next tid from the index into
>    tableam, but do so by basically using index_getnext_tid().

I would prefer it if the new table AM interface was able to totally
replace the existing one, for all types of index scans that currently
use amgettuple. Individual table AMs would generally be expected to
fully move over to the new interface in one go.

That means that we'll need to have index_getnext_tid() support built
into the heapam implementation of said new interface anway. We'll need
it so that it is compatible with index AMs that still use amgettuple
(i.e. that haven't switched over to amgetbatch). Because switching
over to the amgetbatch interface isn't going to happen with every
index AM in a single release -- that definitely isn't practical.

Anyway, I don't see that much point in doing just step 1 in a single
release. If we don't use amgetbatch in some fashion, then we risk
committing something that solves the wrong problem.

> 2) Have the new interface get a single batch of tuples from the index, instead
>    of doing it on a single tid-by-tid basis.

That was already what I had in mind for this new plan/direction. There
isn't any point in having more than 1 index-AM-wise batch when we just
need index AM batching to implement the heapam buffer locking
optimization. (Actually, we'll need up to 2 such batches to handle
things like mark + restore, much like the way that nbtree uses a
separate CurrPos and markPos today.)

> 3) Have heapam not acquire the page lock for each tuple, but do so for all the
>    tuples on the same page.

I think that this is very easy compared to 1 and 2. It doesn't really
seem like it makes sense as a separate step/item?

Unlike with I/O prefetching, there'll be nothing speculative about the
way that the heap buffer lock optimization needs to schedule work done
by index AMs. We'll do it only when we can readily see that there's a
group of contiguous TIDs to be returned by the scan that all point to
the same heap page. There's no eager work done by index AMs compared
to today/compared to amgettuple -- we're just testing if the very next
TID has a matching block number, and then including it when it does.

> 4) Add awareness of multiple batches
>
> 5) Use read stream

I think that it makes sense to do these 2 together. But if we were
going to break them up, my guess is that it'd make the most sense to
start with the read stream work, and only then add support for reading
multiple index-AM-wise batches at a time.

I think that it's essential that the design of amgetbatch be able to
accomodate reading leaf pages that are ahead of the current leaf page,
to maintain heap I/O prefetch distance with certain workloads. But I
don't think it has to do it in the first committed version.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: RFC 9266: Channel Bindings for TLS 1.3 support
Next
From: Nico Williams
Date:
Subject: Re: RFC 9266: Channel Bindings for TLS 1.3 support