Re: index prefetching - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: index prefetching
Date
Msg-id CA+hUKGJOstoEipeKc5naLFjK2TQK5=hKvJ-1pzoQN4U-Mt1ztQ@mail.gmail.com
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, Aug 7, 2025 at 8:41 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Aug 5, 2025 at 7:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > There must be a similar opportunity for parallel index scans.  It has
> > that "seize the scan" concept where parallel workers do one-at-a-time
> > locked linked list leapfrog.
>
> True. More generally, flexibility to reorder work would be useful there.
>
> The structure of parallel B-tree scans is one where each worker
> performs its own "independent" index scan. The workers each only
> return tuples from those leaf pages that they themselves manage to
> read. That isn't particularly efficient, since we'll usually have to
> merge the "independent" index scan tuples together once again using a
> GatherMerge.

Yeah.  This all sails close to the stuff I wrote about in the
post-mortem of my failed attempt to teach parallel bitmap heapscan not
to throw I/O combining opportunities out the window for v18, after
Melanie streamified BHS.  Basically you have competing goals:

 * preserve natural ranges of blocks up to io_combine_limit
 * make workers run out of work at the same time
 * avoiding I/O stalls using lookahead and concurrency

You can't have all three right now: I/O streams are elastic so
allocation decisions made at the producer end don't control work
*finishing* time, so we need something new.  I wrote about an idea
based on work stealing when data runs out.  Read streams would work
independently, but cooperate at end of stream, avoiding interlocking
almost all the time.  That was basically a refinement of an earlier
"shared" read stream that seems too locky.  (Obviously the
seize-the-scan block producer is a total lockfest navitaging a
link-at-a-time data structure, but let's call that a use-case specific
problem.)

Other architectures are surely possible too, including admitting that
precise streams are not right for that problem, and using something
like the PredictBlock() approach mentioned below for prefetching and
then sticking to block-at-a-time work distribution.  Or we could go
the other way and admit that block-at-a-time is also not ideal -- what
if some blocks are 10,000 times more expensive to process than others?
-- and do work stealing that degrades ultimately to tuple granularity,
a logical extreme position.

> > Basically, the stuff that we can't fix with "precise" I/O
> > streaming as I like to call it, where it might still be interesting to
> > think about opportunities to do fuzzier speculative lookahead.  I'll
> > start a new thread.
>
> That sounds interesting. I worry that we won't ever be able to get
> away without some fallback that behaves roughly like OS readahead.

Yeah.  I might write about some of these things properly but here is
an unfiltered brain dump of assorted theories of varying
crackpottedness and some illustrative patches that I'm *not*
proposing:

 * you can make a dumb speculative sequential readahead stream pretty
easily, but it's not entirely satisfying: here's one of the toy
patches I mentioned in Athens, that shows btree leaf scans (but not
parallel ones) doing that, producing nice I/O combining and
concurrency that ramps up in the usual way if it happens to be
sequential (well I just rebased this and didn't test it, but it should
still work); I will describe some other approaches to try to place
this in the space of possibilities I'm aware of...

 * you could make a stream that pulls leaf pages from higher level
internal pages on demand (if you want to avoid the flow control
problems that come from trying to choose a batch size up front before
you know you'll even need it by using a pull approach), or just notice
that it looks sequential and install a block range producer, and if
that doesn't match the next page pointers by the time you get there
then you destroy it and switch strategies, or something

 * you could just pretend it's always sequential and reset the stream
every time you're wrong or some only slightly smarter scheme than
that, but it's still hard to know what's going on in cooperating
processes...

 * you could put sequential extent information in meta blocks or
somehow scatter hints around...

 * you could instead give up on explicit streams for fuzzy problems,
and teach the buffer pool to do the same tricks as the kernel, with a
scheme that lets go of the pins and reacquires them later (hopefully
cheaply with ReadRecentBuffer(), by leaving a trail of breadcrumbs in
SMgrRelation or shared memory, similar to what I already proposed for
btree root pages and another related patch that speeds up seq scans,
which I plan to repost soon): SMgrRelation could hold the state
necessary for the buffer pool to notice when you keep calling
ReadBuffer() for sequential blocks and begin to prefetch them
speculatively with growing distance heuristics so it doesn't overdo
it, but somehow not hold pins on your behalf (this was one of the
driving concerns that made me originally think that I needed an
explicit stream as an explicit opt-in and scoping for extra pins,
which an AM might not want at certain times, truncation or cleanup or
something, who knows)

 * you could steal Linux's BM_READAHEAD concept, where speculatively
loaded pages carry a special marker so they can be recognized by later
ReadBuffer() calls to encourage more magic readahead, because it's
measurably fruitful; this will be seen also by other backends, eg
parallel workers working on the same problem, though there is a bit of
an interleaving edge problem (you probably want to know if adjacent
pages have the flag in some window, and I have an idea for that that
doesn't involve the buffer mapping table); in other words the state
tracked in SMgrRelation is only used to ignite readahead, but shared
flags in or parallel with the buffer pool apply fuel

 * from 30,000 feet, the question is what scope you do the detection
at; you can find examples of OSes that only look at one fd for
sequential detection and only consider strict-next-block (old Unixen,
I suspect maybe Windows but IDK), systems that have a tolerance
windows (Linux), systems that search a small table of active streams
no matter which fds they're coming through (ZFS does this I think,
sort of like our own synchronized_scan detector), and systems that use
per-page accounting to measure and amplify success, and we have
analogies in our architecture as candidate scopes: explicit stream
objects, the per-backend SMgrRelation, the proposed system-wide
SharedSMgrRelation, the buffer pool itself, and perhaps a per-buffer
hint array with relaxed access (this last is something I've
experimented with, both as a way to store relaxed navigation
information for sequential scans skipping the buffer mapping table and
as a place to accumulate prefetch-driving statistics)

 * so far that's just talking about sequential heuristics, but we have
many clues the kernel doesn't, it's just that they're not always
reliable enough for a "precise" read streams and we might not want to
use the approach to speculation I mentioned above where you have a
read stream but as soon as it gives you something you didn't expect
you have to give up completely or reset it and start feeding it again;
presumably you could code around that with a fuzzy speculation buffer
that tolerates a bit of disorder

 * you could make a better version of PrefetchBuffer() for guided
prefetching, let's call it PredictBuffer(), that is initially lazy but
if the predictions turn out to be correct it starts looking further
ahead in the stream of predictions you made and eventually becomes
quite eager, like PrefetchBuffer(), but just lazy enough to perform
I/O combining; note that I'm now talking about "pushing" rather than
"pulling" predictions, another central question with implications, and
one of the nice things about read streams

 * for a totally different line of attack that goes back to precise
pull-based streams, you could imagine a read stream that lets you
'peek' at data coming down the pipe as soon as it is ready (ie it's
already in cache, or it isn't but the IO finishes before the stream
consumer gets to it), so you can get a head start on a jump requiring
I/O in a self-referential data structure like a linked list (with some
obvious limitations); here is a toy patch that allows you to install
such a callback, which could feed next-block information to the main
block number callback, so now we have three times of interest in the
I/O stream: block numbers are pulled into the producer end, valid
pages are pushed out to you as soon as possible somewhere in the
middle or probably often just a step ahead the producer and can feed
block numbers back to it, and pages are eventually pulled out of the
consumer end for processing; BUT NOTE: this idea is not entirely
compatible with the lazy I/O completion draining of io_method=io_uring
(or the posix_aio patch I dumped on the list the other day, and the
Windows equivalent could plausibly go either way), and works much
better with io_method=worker whose completions are advertised eagerly,
so this implementation of the idea is  a dead end, if even the goal
itself is interesting, not sure

 * the same effect could be achieved with chained streams where the
consumer-facing stream is a simple elastic queue of buffers that is
fed by the real I/O stream, with the peeking in between; that would
suit those I/O methods much better; it might need a new
read_stream_next_buffer_conditional() that calls that
WaitReadBuffersWouldStall() function, unless the consumer queue is
empty and it has to call read_stream_next_buffer() which might block;
the point being to periodically pump the peeking mechanism

 * the peek concept is pretty weak on its own because it's hard to
reach a state where you have enough lookahead window that it can
follow a navigational jump in time to save you from a stall but ...
maybe there are streams that contain a lot of either sequential or
well cached blocks with occasional jumps to random I/O; if you could
somehow combine the advanced vapourware of several of these magic
bullet points, then perhaps you can avoid some stalls

Please take all of that with an absolutely massive grain of salt, it's
just very raw ideas...

Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: event trigger support for PL/Python
Next
From: Bertrand Drouvot
Date:
Subject: Re: BF mamba failure