Home > mailing lists
Re: index prefetching - Mailing list pgsql-hackers

From	Peter Geoghegan
Subject	Re: index prefetching
Date	August 5 20:19:45
Msg-id	CAH2-WzkX2fwtiNOX4RrBR8=XKta999NM_5+ghTUnwUBkeyfcHQ@mail.gmail.com Whole thread Raw
In response to	Re: index prefetching (Tomas Vondra <tomas@vondra.me>)
Responses	Re: index prefetching
List	pgsql-hackers
Tree view
On Tue, Aug 5, 2025 at 10:52 AM Tomas Vondra <tomas@vondra.me> wrote:
> I ran some more tests, comparing the two patches, using data sets
> generated in a way to have a more gradual transition between correlated
> and random cases.

Cool.

> So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it
> would be way too many tests, with very little change. Instead, I used
> the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way
> it takes only about ~25 steps for the $fuzz to exceed $rows=10M.

I think that it probably makes sense to standardize on using fewer
distinct "fuzz" settings than this going forward. It's useful to test
more things at first, but I expect that the performance impact of
changes from a given new patch revision will become important soon.

> I don't claim those data sets are perfect, or a great representation of
> particular (real-world) data sets. It seems like a much nicer transition
> between random and correlated data sets.

That makes sense to me.

A test suite that is representative of real-world usage patterns isn't
so important. But it is important that we have at least one test for
each interesting variation of an index scan. What exactly that means
is subject to interpretation, and will likely evolve over time. But
the general idea is that we should choose tests that experience has
shown to be particularly good at highlighting the advantages or
disadvantages of one approach over another (e.g., simple vs complex).

It's just as important that we cut tests that don't seem to tell us
anything we can't get from some other tests. I suspect that many $fuzz
values aren't at all interesting. We double to get each increment, but
that probably isn't all that informative, outside of the extremes.

It'd also be good to just not test "sync" anymore, at some point. And
maybe to standardize on testing either "worker" or "io_uring" for most
individual tests. There's just too many tests right now.

> In most cases the two patches perform fairly close - the green and red
> data series mostly overlap. But there are cases where the complex patch
> performs much better - especially for low fuzz values. Which is not
> surprising, because those cases require higher prefetch distance, and
> the complex patch can do that.

Right.

> It surprised me a bit the complex patch can actually help even cases
> where I'd not expect prefetching to help very much - e.g. fuzz=0 is
> perfectly correlated, I'd expect read-ahead to work just fine. Yet the
> complex patch can help ~2x (at least when scanning larger fraction of
> the data).

Maybe it has something to do with reading multiple leaf pages together
leading to fewer icache misses.

Andres recently told me that he isn't expecting to be able to simulate
read-ahead with direct I/O. It seems possible that read-ahead
eventually won't be used at all, which argues for the complex patch.

BTW, I experimented with using READ_STREAM_USE_BATCHING (not
READ_STREAM_DEFAULT) in the complex patch. That's probably
deadlock-prone, but I suspect that it works well enough to get a good
sense of what is possible. What I saw (with that same TPC-C test
query) was that "I/O Timings" was about 10x lower, even though the
query runtime didn't change at all. This suggests to me that "I/O
Timings" is an independently interesting measure: getting it lower
might not visibly help when only one query runs, but it'll likely
still lead to more efficient use of available I/O bandwidth in the
aggregate (when many queries run at the same time).

> There are also a couple cases where "simple" performs better than
> "complex". But most of the time this is only for the "sync" iomethod,
> and when scanning significant fraction of the data (10%+). So that
> doesn't seem like a great argument in favor of the simple patch,
> considering "sync" is not a proper AIO method, I've been arguing against
> using it as a default, and with methods like "worker" the "complex"
> patch often performs better ...

I suspect that this is just a case of "sync" making aggressive
prefetching a bad idea in general.

> Let's say the complex patch is the way to go. What are the open problems
> / missing parts we need to address to make it committable?

I think that what you're interested in here is mostly project risk --
things that come with a notable risk of blocking commit/significantly
undermining our general approach.

> I can think of these issues. I'm sure the list is incomplete and there
> are many "smaller issues" and things I haven't even thought about:

I have a list of issues to solve in my personal notes. Most of them
aren't particularly important.

> 1) Making sure the interface can work for other index AMs (both in core
> and out-of-core), including cases like GiST etc.

What would put your mind at ease here? Maybe you'd feel better about
this if we also implemented prefetching for at least one other index
AM. Maybe GiST, since it's likely both the next-hardest and next most
important index AM (after nbtree).

Right now, I'm not motivated to work on the patch at all, since it's
still not clear that any of it has buy-in from you. I'm willing to do
more work to try to convince you, but it's not clear what it would
take/where your doubts are. I'm starting to be concerned about that
just never happening, quite honestly. Getting a feature of this
complexity into committable shape requires laser focus.

> 2) Proper layering between index AM and table AM (like the TID issue
> pointed out by Andres some time ago).
>
> 3) Allowing more flexible management of prefetch distance (this might
> involve something like the "scan manager" idea suggested by Peter),
> various improvements to ReadStream heuristics, etc.

The definition of "scan manager" is quite fuzzy right now. I think
that the "complex" patch already implements a very basic version of
that idea.

To me, the important point was always that the general design/API of
index prefetching be structured in a way that would allow us to
accomodate more sophisticated strategies. As I've said many times,
somebody needs to see all of the costs and all of the benefits --
that's what's needed to make optimal choices.

> 4) More testing to minimize the risk of regressions.
>
> 5) Figuring out how to make this work for IOS (the simple patch has some
> special logic in the callback, which may not be great, not sure what's
> the right solution in the complex patch).

I agree that all these items are probably the biggest risks to the
project. I'm not sure that I can attribute this to the use of the
"complex" approach over the "simple" approach.

> 6) ????

I guess that this means "unknown unknowns", which are another significant risk.

--
Peter Geoghegan
pgsql-hackers by date:
From: Paul A Jungwirth
Date: 05 August, 19:51:16
Subject: Re: ReplicationSlotRelease() crashes when the instance is in the single user mode
From: Peter Eisentraut
Date: 05 August, 20:20:20
Subject: pgaio_io_get_id() type (was Re: Datum as struct)
Re: index prefetching - Mailing list pgsql-hackers

Previous

Next