Home > mailing lists

Re: index prefetching - Mailing list pgsql-hackers

From	Konstantin Knizhnik
Subject	Re: index prefetching
Date	January 5 15:21:31
Msg-id	dcbc802c-97c8-48ea-a40f-358482b9def6@garret.ru Whole thread Raw
In response to	Re: index prefetching (Tomas Vondra <tomas@vondra.me>)
Responses	Re: index prefetching
List	pgsql-hackers

Tree view

On 05/01/2026 2:21 AM, Tomas Vondra wrote:

Even a small fixes overhead can be quite visible for extremely short
queries that can't possibly benefit from prefetching (like the LIMIT 1).

That's what we were concerned about when we invented this heuristics
(which only prefetches from the 2nd batch). I agree it's crude, no
argument there. I'm not against having something better, as long as we
can make it reliable.

I completely agree that we should not create read stream from the very beginning despite to small overhead.
My concern is that second batch is not good criteria: depending on key size and position of the first key on the page, it can be too short.

One of the problems with read_stream is that it may look ahead very far
ahead. Much further than the query will need. Imagine an index-only
scan, with 99.99% pages being all-visible. At some point the scan will
hit an item that requires reading the heap page. Which triggers the
look-ahead, and a search for the *next* heap page.

But that next index entry may be many leafs ahead. And maybe the scan
never even gets to actually need that, which means the work will be
wasted. If the query only needs a couple tuples, this may cause a bad
regression.

Sorry, I do not completely understand how it can happen.
Read stream can only fetch heap pages which are referenced by TIDs in leave pages (leaf=batch).
So read stream can not advance more than TIDs from currently processes leaf page, can it?

IIRC the same thing can happen for plain index scans. You may need to
try with correlated indexes (I think the read_stream callback skips
duplicate TIDs).

Another problem is that different TIDs can refer to the same heap page (which happens with clustered index or table populated in key ascending order).
But number of SMGR reads criteria also should work good in this case.

When all data is cached in shared buffers, then we do not perform IO at
all.
It means there it doesn't matter whether and when we initialize
read_stream.
We can do it after processing 10 items (current default), or immediately
- it should not affect performance.
And this is what I have tested: performance actually not depends on
`read_stream_threshold` (if data fits in shared buffers).
At least it is within few percents and may be it is just random
fluctuations.
Obviously there is no 25% degradation.

I'm not sure it's this simple. Even if we perform no physical I/O, the
read_stream callback will need to go through the whole batch and pass
all the items to the read_stream. The data may be cached, but not in
shared buffers, in which case the read_stream will do the IPC etc.

You mean that data is present in OS file cache and reading it is fast and will not benefit from AIO?
It certainly can happen but it seems to be quite obvious problem of double buffering.

It definitely doesn't mean that it is not possible to find scenario
where this approach with enabling prefetch after processing N items will
show worse performance than master or v5. We just need to properly
choose cache hit rate. But the same is true IMHO for v5 itself: it is
possible to find workload where it will show the same degradation
comparing with master.

I'm sure such "adversarial" data sets exist, even if we don't have a
good example at hand. The question is how plausible / realistic such
data sets are, though.

I'm not sure about v5, though. The assumption was the overhead is most
visible for short queries, and once we get to the 2nd batch the query is
expensive enough to to not be affected too much.

When you say "we just need to properly choose cache hit rate", how would
we do that? And how would we know/determine the optimal threshold?

Sorry, may be I was unclear.
Saying about "choosing cache hit rate" I didn't not mean to use it for determining proper threshold.
I was thinking about the best/worst scenario for index prefetch.
Originally in my "benchmark" I considered case when no heap pages are present in shared buffer.
In your scenario - size of shared buffers is larger than size of table, so all pages are present in shared buffers.
First case is certainly the best scenario for index prefetch where we can expect to get the largest improvement.
In your case there is certainly no improvement, but as we can see - overhead is also not so large (just because we do not read any page).
I expect that the worst result will be in some intermediate case - when some pages are present in shared buffer, some - not.
It will be especially true for "second batch" criteria, because it can happen that we need to read just the single heap page for the whole batch and using read stream for it will just add extra overhead.

More precise heuristic should IMHO take in account actual number of
performed disk read.

I'm not sure "disk reads" is the right term. The data may be in page
cache, and we just need to copy them into shared buffers.

I agree, but there seems to be no chance to determine if page was actually read from disk or cache.
But as I wrote above, it is problem of double buffering. If we can avoid it (i.e. using direct IO), then there will be no such problem.

Please notice that I do not want to predict number of disk reads - i.e.
check if candidates for prefetch are present in shared buffers.
It will really adds significant overhead. I think that it is better to
use as threshold number of performed reads.

Not sure I understand what this says. Can you elaborate?

I sent proposed patch in the previous mail.
Did you have a chance to look at it?

pgsql-hackers by date:

From: Amit Kapila
Date: 05 January, 15:18:21
Subject: Re: DOCS - Clarify the publication 'publish_via_partition_root' default value.

From: "Greg Burd"
Date: 05 January, 15:42:07
Subject: Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ?barriers

Re: index prefetching - Mailing list pgsql-hackers

Previous

Next