Re: index prefetching - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: index prefetching
Date
Msg-id CA+hUKGKmka4FSJhKpf3tbEwtGZoKtPcZC-vDzTyBJ7bdys=V+A@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
Responses Re: index prefetching
List pgsql-hackers
On Fri, Aug 15, 2025 at 11:21 AM Tomas Vondra <tomas@vondra.me> wrote:
> I don't recall all the details, but IIRC my impression was it'd be best
> to do this "caching" entirely in the read_stream.c (so the next_block
> callbacks would probably not need to worry about lastBlock at all),
> enabled when creating the stream. And then there would be something like
> read_stream_release_buffer() that'd do the right to release the buffer
> when it's not needed.

I've thought about this problem quite a bit.  xlogprefetcher.c was
designed to use read_stream.c, as the comment above LsnReadQueue
vaguely promises, and I have mostly working patches to finish that job
(more soon).  The WAL is naturally full of repetition with
interleaving patterns, so there are many opportunities to avoid buffer
mapping table traffic, pinning, content locking and more.

I'm not sure that read_stream.c is necessarily the right place,
though.  I have experimented with that a bit, using a small window of
recently accessed blocks, with various designs.

One of my experiments did it further down.  I shoved a cache line of
blocknum->buffernum mappings into SMgrRelation so you can skip the
buffer mapping table and find repeat accesses.  I tried FIFO
replacement, vectorised CLOCK (!) and some hairbrained things for this
nano-buffer map.  At various times I had goals including remembering
where to find the internal pages in a high frequency repeated btree
search (eg inserting with monotonically increasing keys or nested loop
with increasing or repeated keys), and, well, lots of other stuff.
That was somewhat promising (you can see a variant of that in one of
the patches in the ReadRecentBuffer() thread that I will shortly be
rehydrating), but I wasn't entirely satisfied because it still had to
look up the local pin count, if there is one, so I had plans to
investigate a tighter integration with that stuff too.  Coming back to
the WAL, I want something that can cheaply find the buffer and bump
the local pin count (rather than introducing a secondary reference
counting scheme in the WAL that I think you might be describing?), and
I want it to work even if it's not in the read ahead window because
the distance is very low, ie fully cached replay.

Anway, that was all about microscopic stuff that I want to do to speed
up CPU bound replay with little or no I/O.

This stall on repeated access to a block with IO already in progress
is a different beast, and I look forward to checking out the patch
that Andres just described.  By funny coincidence I was just studying
that phenomenon and code path last week in the context of my
io_method=posix_aio patch.  There, completing other processes' IOs is
a bit more expensive and I was thinking about ways to give the
submitting backend more time to handle it if this backend is only
looking ahead and doesn't strictly need the IO to be completed right
now to make progress.  I was studying competing synchronized_scans, ie
other backends' IOs, not repeat access in this backend, but the
solution he just described sounds like a way to hit both birds with
one stone, and makes a pretty good trade-off: the other guy's IO
almost certainly won't fail, and we almost certainly aren't
deadlocked, and if that bet is wrong we can deal with it later.



pgsql-hackers by date:

Previous
From: Chao Li
Date:
Subject: Re: Make pgoutput documentation easier to find
Next
From: Thomas Munro
Date:
Subject: Re: index prefetching