Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 03dcc1a9-c5d0-4965-889c-684dc0a7580c@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: index prefetching
List pgsql-hackers

On 7/16/25 15:36, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:
>> But the thing I don't really understand it the "cyclic" dataset (for
>> example). And the "simple" patch performs really badly here. This data
>> set is designed to not work for prefetching, it's pretty much an
>> adversary case. There's ~100 TIDs from 100 pages for each key value, and
>> once you read the 100 pages you'll hit them many times for following
>> values. Prefetching is pointless, and skipping duplicate blocks can't
>> help, because the blocks are not effective.
>>
>> But how come the "complex" patch does so much better? It can't really
>> benefit from prefetching TID from the next leaf - not this much. Yet it
>> does a bit better than master. I'm looking at this since yesterday, and
>> it makes no sense to me. Per "perf trace" it actually does 2x many
>> fadvise calls compared to the "simple" patch (which is strange on it's
>> own, I think), yet it's apparently so much faster?
> 
> The "simple" patch has _bt_readpage reset the read stream. That
> doesn't make any sense to me. Though it does explain why the "complex"
> patch does so many more fadvise calls.
> 

Why it doesn't make sense? The reset_stream_reset() restarts the stream
after it got "terminated" on the preceding leaf page (by returning
InvalidBlockNumber). It'd be better to "pause" the stream somehow, but
there's nothing like that yet. We have to terminate it and start again.

But why would it explain the increase in fadvise calls?

FWIW the pattern of fadvise call is quite different. For the simple
patch we end up doing just this:

fadvise block 1
read block 1
fadvise block 2
read block 2
fadvise block 3
read block 3
...

while for the complex patch we do a small batch (~10) of fadvise calls,
followed by the fadvise/read calls for the same set of blocks:

fadvise block 1
fadvise block 2
...
fadvise block 10
read block 1
fadvise block 2
read block 2
...
fadvise block 10
read block 10

This might explain the advantage of the "complex" patch, because it can
actually do some prefetching every now and then (if my calculation is
right, about 5% blocks needs prefetching).

Te pattern of fadvise+pread for the same block seems a bit silly. And
this is not just about "sync" method, the other methods will have a
similar issue with no starting the I/O earlier. The fadvise is just
easier to trace/inspect.

I suspect this might be an unintended consequence of the stream reset.
AFAIK it wasn't quite meant to be used this way, so maybe it confuses
the built-in heuristics deciding what to prefetch?

If that's the case, I'm afraid the "complex" patch will have the issue
too, because it will need to "pause" the prefetching in some cases too
(e.g. for index-only scans, or when the leaf pages contain very few
index tuples). Will be less common, of course.


> Another issue with the "simple" patch: it adds 2 bool fields to
> "BTScanPosItem". That increases its size considerably. We're very
> sensitive to the size of this struct (I think that you know about this
> already). Bloating it like this will blow up our memory usage, since
> right now we allocate MaxTIDsPerBTreePage/1358 such structs for
> so->currPos (and so->markPos). Wasting all that memory on alignment
> padding is probably going to have consequences beyond memory bloat.
> 

True, no argument here.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: Suggestion to add --continue-client-on-abort option to pgbench
Next
From: Peter Geoghegan
Date:
Subject: Re: index prefetching