Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 1a4536f9-e4cb-419d-84ab-7ed6cfc62fa5@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On 7/23/25 03:31, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra <tomas@vondra.me> wrote:
>>> I happen to think that that's a very unrealistic assumption. Most
>>> standard benchmarks have indexes that almost all look fairly similar
>>> to pgbench_accounts_pkey, from the point of view of "heap page blocks
>>> per leaf page". There are exceptions, of course (e.g., the TPC-C order
>>> table's primary key suffers from fragmentation).
>>>
>>
>> I agree with all of this.
> 
> Cool.
> 
>> I assume you mean results for the "linear" data set, because for every
>> other data set the patches perform almost exactly the same (when
>> restoring the distance after stream reset):
>>
>>
https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf
> 
> Right.
> 
>> And it's a very good point. I was puzzled by this too for a while, and
>> it took me a while to understand how/why this happens. It pretty much
>> boils down to the "duplicate block" detection and how it interacts with
>> the stream resets (again!).
> 
> I think that you slightly misunderstand where I'm coming from here: it
> *doesn't* puzzle me. What puzzled me was that it puzzled you.
> 
> Andres' test query is very simple, and not entirely sympathetic
> towards the complex patch (by design). And yet it *also* gets quite a
> decent improvement from the complex patch. It doesn't speed things up
> by another order of magnitude or anything, but it's a very decent
> improvement -- one well worth having.
> 
> I'm also unsurprised at the fact that all the other tests that you ran
> were more or less a draw between simple and complex. At least not now
> that I've drilled down and understood what the indexes from those
> other test cases actually look like, in practice.
> 
>> So you're right the complex patch prefetches far ahead. I thought the
>> distance will quickly decrease because of the duplicate blocks, but I
>> missed the fact the read stream will not seem them at all.
> 
> FWIW I wasn't thinking about it at anything like that level of
> sophistication. Everything I've said about it was based on intuitions
> about how the prefetching was bound to work, for each different kind
> of index. I just looked at individual leaf pages (or small groups of
> them) from each index/test, and considered their TIDs, and imagined
> how that was likely to affect the scan.
> 
> It just seems obvious to me that all the tests (except for "linear")
> couldn't possibly be helped by eagerly reading multiple leaf pages. It
> seemed equally obvious that it's quite possible to come up with a
> suite of tests that have several tests that could benefit in the same
> way (not just 1). Although your "linear_1"/"linear_N" tests aren't
> actually like that, many cases will be -- and not just those that are
> perfectly correlated ala pgbench.
> 
>> I'm not sure it's desirable to "hide" blocks from the read stream like
>> this - it'll never see the misses. How could it make good decisions,
>> when we skew the data used by the heuristics like this?
> 
> I don't think that I fully understand what's desirable here myself.
> 
>>> Doing *no* prefetching will usually be the right thing to do. Does
>>> that make index prefetching pointless in general?
>>>
>>
>> I don't think so. Why would it? There's plenty of queries that can
>> benefit from it a lot, and as long as it doesn't cause harm to other
>> queries it's a win.
> 
> I was being sarcastic. That wasn't a useful thing for me to do. Apologies.
> 
>> This is not resetting the stream, though. This is resetting the position
>> tracking how far the stream got.
> 
> My main point is that there's stuff going on here that nobody quite
> understands just yet. And so it probably makes sense to defensively
> assume that the prefetch distance resetting stuff might matter with
> either the complex or simple patch.
> 
>> Sorry, got distracted and forgot to complete the sentence. I think I
>> wanted to write "mostly from not resetting the distance to 1". Which is
>> true, but the earlier "linear" example also shows there are cases where
>> the page boundaries are significant.
> 
> Of course that's true. But that was just a temporary defect of the
> "simple" patch (and perhaps even for the "complex" patch, albeit to a
> much lesser degree). It isn't really relevant to the important
> question of whether the simple or complex design should be pursued --
> we know that now.
> 
> As I said, I don't think that the test suite is particularly well
> suited to evaluating simple vs complex. Because there's only one test
> ("linear") that has any hope of being better with the complex patch.
> And because having only 1 such test isn't representative.
> 
>> That's actually intentional. I wanted to model tables with wider tuples,
>> without having to generate all the data etc. Maybe 25% is too much, and
>> real table have more than 20 tuples. It's true 400B is fairly large.
> 
> My point about fill factor isn't particularly important.
> 

Yeah, the randomness of the TIDs matters too much.

>> I'm not against testing with other parameters, of course. The test was
>> not originally written for comparing different prefetching patches, so
>> it may not be quite fair (and I'm not sure how to define "fair").
> 
> I'd like to see more than 1 test where eagerly reading leaf pages has
> any hope of helping. That's my only important concern.
> 

Agreed.

>> It's not uniformly random, I wrote it uses normal distribution. The
>> query in the SQL script does this:
>>
>>   select x + random_normal(0, 1000) from ...
>>
>> It is a synthetic test data set, of course. It's meant to be simple to
>> generate, reason about, and somewhere in between the "linear" and
>> "uniform" data sets.
> 
> I always start by looking at the index leaf pages, and imagining how
> an index scan can/will deal with that.
> 
> Just because it's not truly uniformly random doesn't mean that that's
> apparent when you just look at one leaf page -- heap blocks might very
> well *appear* to be uniformly random (or close to it) when you drill
> down like that. Or even when you look at (say) 50 neighboring leaf
> pages.
> 

Yeah, the number of heap blocks per leaf page is a useful measure. I
should have thought about that.

The other thing worth tracking is probably how the number of heap blocks
increases with multiple leaf pages, to measure the "hit ratio".

I should have thought about this more when creating the data sets ...

>> But it also has realistic motivation - real tables are usually not as
>> clean as "linear", nor as random as the "uniform" data sets (not for all
>> columns, at least). If you're looking at data sets like "orders" or
>> whatever, there's usually a bit of noise even for columns like "date"
>> etc. People modify the orders, or fill-in data from a couple days ago,
>> etc. Perfect correlation for one column implies slightly worse
>> correlation for another column (order date vs. delivery date).
> 
> I agree.
> 
>> Right. I don't see a problem with this. I'm not saying parameters for
>> this particular data set are "perfect", but the intent is to have a
>> range of data sets from "perfectly clean" to "random" and see how the
>> patch(es) behave on all of them.
> 
> Obviously none of your test cases are invalid -- they're all basically
> reasonable, when considered in isolation. But the "linear_1" test is
> *far* closer to the "uniform" test than it is to the "linear" test. At
> least as far as the simple vs complex question is concerned.
> 

Perhaps not invalid, but it also does not cover the space of possible
data sets the way I intended. It seems all the data sets are much more
random than I expected.

>> If you have a suggestion for different data sets, or how to tweak the
>> parameters to make it more realistic, I'm happy to try those.
> 
> I'll get back to you on this soon. There are plenty of indexes that
> are not perfectly correlated (like pgbench_accounts_pkey is) that'll
> nevertheless benefit significantly from the approach taken by the
> complex patch. I'm sure of this because I've been using the query I
> posted early for many years now -- I've thought about and directly
> instrumented the "nhtids:nhblks" of an index of interest many times in
> the past.
> 

Thanks!


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: 024_add_drop_pub.pl might fail due to deadlock
Next
From: Bertrand Drouvot
Date:
Subject: Re: Custom pgstat support performance regression for simple queries