Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id aaa14b36-0740-46db-a914-5dbfca129abd@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Konstantin Knizhnik <knizhnik@garret.ru>)
Responses Re: index prefetching
List pgsql-hackers
On 12/29/25 14:37, Konstantin Knizhnik wrote:
> 
> On 29/12/2025 1:53 AM, Tomas Vondra wrote:
>> It seems this is due to sending an extra SET (for the new GUC) in the
>> pgbench script, which is recognized only on the v5+threshold build.
>>
>> That's a thinko on my side, I should have realized the extra command
>> might affect this. It doesn't really affect the behavior, because 10 is
>> the default value for read_stream_threshold. I've fixed the script, will
>> check fresh results tomorrow.
>>
>> Still, I think most of what I said about heuristics when to initialize
>> the read stream, and the risk/benefit tradeoff, still applies.
> 
> I did a lot of experiments this morning but could not find any
> noticeable difference at any configuration when all working set fits in
> shared buffers.

I found any significant differences either, in my tests. It's possible
this is thanks to some of the recent improvements. I also recall some of
the regressions (that made us to add this heuristics) were tied to
specific data patterns, and entirely invisible on other data sets.

> And frankly speaking after more thinking I do not see good reasons which
> can explain such difference.
> Just initialization of read stream should not add much overhead - it
> seems to be not expensive operation.

Even a small fixes overhead can be quite visible for extremely short
queries that can't possibly benefit from prefetching (like the LIMIT 1).

That's what we were concerned about when we invented this heuristics
(which only prefetches from the 2nd batch). I agree it's crude, no
argument there. I'm not against having something better, as long as we
can make it reliable.

One of the problems with read_stream is that it may look ahead very far
ahead. Much further than the query will need. Imagine an index-only
scan, with 99.99% pages being all-visible. At some point the scan will
hit an item that requires reading the heap page. Which triggers the
look-ahead, and a search for the *next* heap page.

But that next index entry may be many leafs ahead. And maybe the scan
never even gets to actually need that, which means the work will be
wasted. If the query only needs a couple tuples, this may cause a bad
regression.

IIRC the same thing can happen for plain index scans. You may need to
try with correlated indexes (I think the read_stream callback skips
duplicate TIDs).

> What is actually matter is async IO. Without read stream, Postgres reads
> heap pages using sync operation: backend just calls pread.
> With read stream, AIO is used. By default "worker" AIO mode is used, it
> means that backend sends request to one of the workers and wait for it's
> completion. Worker receives request, performs IO and notifies backend.
> Such interprocess communication adds significant overhead and this is
> why if we initialize read stream from the very beginning, then we get
> about ~4x worse performance with LIMIT 1.
> 

Yes, the IPC (with worker) can be quite costly. Especially with data
patterns that force one roundtrip per block.

> Please correct me if I wrong (or it is Mac specific), but it is not
> caused by any overhead related with read_stream, but by AIO.
> I have not made such experiment, but it seems to me that if we make read
> stream to perform sync calls, then there will be almost no difference in
> performance.

Why would it matter if the overhead is caused by AIO or the read_stream
itself? What matters is the impact on queries, no?

Anyway, as I explained earlier, I believe some of the overhead is really
about the read_stream, which can be "forced" to do a lot of work which
is then wasted.

A couple days ago Thomas posted a patch with read_stream yield, allowing
the callback to "give up" if it gets too far ahead. That might mitigate
this particular issue.

> 
> When all data is cached in shared buffers, then we do not perform IO at
> all.
> It means there it doesn't matter whether and when we initialize
> read_stream.
> We can do it after processing 10 items (current default), or immediately
> - it should not affect performance.
> And this is what I have tested: performance actually not depends on
> `read_stream_threshold` (if data fits in shared buffers).
> At least it is within few percents and may be it is just random
> fluctuations.
> Obviously there is no 25% degradation.
> 

I'm not sure it's this simple. Even if we perform no physical I/O, the
read_stream callback will need to go through the whole batch and pass
all the items to the read_stream. The data may be cached, but not in
shared buffers, in which case the read_stream will do the IPC etc.

> 
> It definitely doesn't mean that it is not possible to find scenario
> where this approach with enabling prefetch after processing N items will
> show worse performance than master or v5. We just need to properly
> choose cache hit rate. But the same is true IMHO for v5 itself: it is
> possible to find workload where it will show the same degradation
> comparing with master.
> 
I'm sure such "adversarial" data sets exist, even if we don't have a
good example at hand. The question is how plausible / realistic such
data sets are, though.

I'm not sure about v5, though. The assumption was the overhead is most
visible for short queries, and once we get to the 2nd batch the query is
expensive enough to to not be affected too much.

When you say "we just need to properly choose cache hit rate", how would
we do that? And how would we know/determine the optimal threshold?

> 
> More precise heuristic should IMHO take in account actual number of
> performed disk read.

I'm not sure "disk reads" is the right term. The data may be in page
cache, and we just need to copy them into shared buffers.

> Please notice that I do not want to predict number of disk reads - i.e.
> check if candidates for prefetch are present in shared buffers.
> It will really adds significant overhead. I think that it is better to
> use as threshold number of performed reads.
> 

Not sure I understand what this says. Can you elaborate?

> Unfortunately looks like it is not possible to accumulate such
> information without changing other Postgres code.
> For example, if `ReadBuffer` can somehow inform caller that it actually 
> performs read, then it can be easily calculate number of reads in
> `heapam_index_fetch_tuple`:
> 
> ```
> 
> static pg_attribute_always_inline Buffer
> ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
>                   ForkNumber forkNum,
>                   BlockNumber blockNum, ReadBufferMode mode,
>                   BufferAccessStrategy strategy,
>                   bool* fast_path)
> {
>      ...
>     if (StartReadBuffer(&operation,
>                         &buffer,
>                         blockNum,
>                         flags))
> 
>     {
>         WaitReadBuffers(&operation);
>         *fast_path = false;
>     }
>     else
>          *fast_path = true;
>     return buffer;
> }
> 
> It can be certainly achieved without changed ReadBuffer* family, just by
> directly calling StartReadBuffer
> from `heapam_index_fetch_tuple` instead of `ReadBuffer`.
> Not so nice because we have to duplicate some bufmgr code. Not so much -
> check for local relation and
> filling `ReadBuffersOperation`structure. But it is better to avoid it.
> 

Yeah, this is not great. There must be a better way to do this.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: DOCS - "\d mytable" also shows any publications that publish mytable
Next
From: Michael Paquier
Date:
Subject: Re: Typos in the code and README