Home > mailing lists

Re: index prefetching - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: index prefetching
Date	August 25 21:33:27
Msg-id	5119a10a-4006-4f9c-a2de-9b366173295c@vondra.me Whole thread Raw
In response to	Re: index prefetching (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: index prefetching
List	pgsql-hackers

Tree view


On 8/25/25 19:57, Peter Geoghegan wrote:
> On Mon, Aug 25, 2025 at 10:18 AM Tomas Vondra <tomas@vondra.me> wrote:
>> Almost all regressions (at least the top ones) now look like this, i.e.
>> distance collapses to ~2.0, which essentially disables prefetching.
> 
> Good to know.
> 
>> But I no longer think it's caused by the "priorbatch" optimization,
>> which delays read stream creation until after the first batch. I still
>> think we may need to rethink that (e.g. if the first batch is huge), but
>> he distance can "collapse" even without it. The optimization just makes
>> it easier to happen.
> 
> That shouldn't count against the "priorbatch" optimization. I still
> think that this issue should be treated as 100% unrelated to the
> "priorbatch" optimization.
> 
> You might very well be right that the "priorbatch" optimization is too
> naive about index scans whose first/possibly only leaf page has TIDs
> that point to many distinct heap blocks (hundreds, say). But there's
> no reason to think that that's truly relevant to the problem at hand.
> If there was such a problem, then it wouldn't look like a regression
> against enable_indexscan_prefetch = off/master. We'd likely require a
> targeted approach to even notice such a problem; so far, most/all of
> our index scan test cases have read hundreds/thousands of index pages
> -- so any problem that's limited to the first leaf page read is likely
> to go unnoticed.
> 
> I think that the "priorbatch" optimization at least takes
> *approximately* the right approach, which is good enough for now. It
> at least shouldn't ever do completely the wrong thing. It even seems
> possible that sufficiently testing will actually show that its naive
> approach to be the best one, on balance, once the cost of adding
> mitigations (costs for all queries, not just ones like the one you
> looked at recently) is taken into account.
> 
> I suggest that we not even think about "priorbatch" until the problem
> on the read stream side is fixed. IMV we should at least have a
> prototype patch for the read stream that we're reasonably happy with
> before looking at "priorbatch" in further detail. I don't think we
> have that right now.
> 

Right. I might have expressed it more clearly, but this is what I meant
when I said priorbatch is not causing this.

As for priorbatch, I'd still like to know where does the overhead come
from. I mean, what's the expensive part of creating a read stream? Maybe
that can be fixed, instead of delaying the creation, etc. Maybe the
delay could happen within read_stream?

>> AFAICS the distance collapse is "inherent" to how the distance gets
>> increased/decreased after hits/misses.
> 
> Right. (I think that you'll probably agree with me about addressing
> this problem before even thinking about limitations in the
> "priorbatch" optimization, but I thought it best to be clear about
> that.)
> 

Agreed.

>> I find this distance heuristics a bit strange, for a couple reasons:
>>
>> * It doesn't seem right to get stuck at distance=2 with 50% misses.
>> Surely that would benefit from prefetching a bit more?
> 
> Maybe, but at what cost? It doesn't necessarily make sense to continue
> to read additional leaf pages, regardless of the number of heap buffer
> hits in the recent past. At some point it likely makes more sense to
> just give up and do actual query processing/return rows to the scan.
> Even without a LIMIT. I have low confidence here, though.
> 

Yes, it doesn't make sense to continue forever. That was the point of
distance_max in my patch - if we don't get enough I/Os by that distance,
we give up.

I'm not saying we should do whatever to meet effective_io_concurrency.
It just seems a bit strange to ignore it like this, because right now it
has absolutely no impact on the read stream. If the query gets into the
"collapsed distance", it'll happen with any effective_io_concurrency.

>> * It mostly ignores effective_io_concurrency, which I think about as
>> "Keep this number of I/Os in the queue." But we don't try doing that.
> 
> As I said, I might just be wrong about "just giving up at some point"
> making sense. I just don't necessarily think it makes sense to go from
> ignoring effective_io_concurrency to *only* caring about
> effective_io_concurrency. It's likely true that keeping
> effective_io_concurrency-many I/Os in flight is the single most
> important thing -- but I doubt it's the only thing that ever matters
> (again, even assuming that there's no LIMIT involved).
> 

I'm not saying we should only care about effective_io_concurrency. But
it seems like a reasonable goal to issue the I/Os early, if we're going
to issue them at some point.

>> Attached is an example table/query, found by my script. Without the
>> read_stream patch (i.e. just with the current index prefetching), it
>> looks like this:
> 
>> So it's more a case of "mitigating a regression" (finding regressions
>> like this is the purpose of my script). Still, I believe the questions
>> about the distance heuristics are valid.
>>
>> (Another interesting detail is that the regression happens only with
>> io_method=worker, not with io_uring. I'm not sure why.)
> 
> I find that the regression happens with io_uring. I also find that
> your patch doesn't fix it. I have no idea why.
> 

That's weird. Did you see an increase of the prefetch distance? What
does the EXPLAIN ANALYZE say about that?

regard

-- 
Tomas Vondra

pgsql-hackers by date:

From: Daniel Gustafsson
Date: 25 August, 21:32:51
Subject: Re: Changing the state of data checksums in a running cluster

From: Sami Imseih
Date: 25 August, 21:44:22
Subject: Re: GetNamedLWLockTranche crashes on Windows in normal backend

Re: index prefetching - Mailing list pgsql-hackers

Previous

Next