Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id ec4d8c7b-bbc7-4805-919f-8fdeb11b3d34@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: index prefetching
List pgsql-hackers

On 7/13/25 01:50, Peter Geoghegan wrote:
> On Thu, May 1, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:
>> There's two "fix" patches trying to make this work - it does not crash,
>> and almost all the "incorrect" query results are actually stats about
>> buffer hits etc. And that is expected to change with prefetching, not a
>> bug. But then there are a bunch of explains where the number of index
>> scans changed, e.g. like
>>
>> -         Index Searches: 5
>> +         Index Searches: 4
>>
>> And that is almost certainly a bug.
>>
>> I haven't figured this out yet, and I feel a bit lost again :-(
> 
> For the benefit of other people reading this thread: I sent Tomas a
> revised version of this "complex" patch this week, fixing all these
> bugs. It only took me a few hours, and I regret not doing that work
> sooner.
> 
> I also cleaned up nbtree aspects of the "complex" patch considerably.
> The nbtree footprint was massively reduced:
> 
> 17 files changed, 422 insertions(+), 685 deletions(-)
> 
> So there's a net negative nbtree code footprint. We're effectively
> just moving things out of nbtree that are already completely
> index-AM-generic. I think that the amount of code that can be removed
> from nbtree (and other AMs that currently use amgettuple) will be even
> higher if we go this way.
> 

Thank you! I'll take a look next week, but these numbers suggest you
simplified it a lot..

>> The one real limitation of the simpler approach is that prefetching is
>> limited to a single leaf page - we can't prefetch from the next one,
>> until the scan advances to it. But based on experiments comparing this
>> simpler and the "complex" approach, I don't think that really matters
>> that much. I haven't seen any difference for regular queries.
> 
> Did you model/benchmark it?
> 

Yes. I did benchmark the simple and complex versions I had at the time.
But you know how it's with benchmarking - I'm sure it's possible to pick
queries where it'd make a (significant) difference.

For example if you make the index tuples "fat" that would make the
prefetching less efficient.

Another thing is hardware. I've been testing on local NVMe drives, and
those don't seem to need very long queues (it's diminishing returns).
Maybe the results would be different on systems with more I/O latency
(e.g. because the storage is not local).


>> The one case where I think it might matter is queries with array keys,
>> where each array key matches a single tuple on a different leaf page.
>> The complex patch might prefetch tuples for later array values, while
>> the simpler patch won't be able to do that. If an array key matches
>> multiple tuples, the simple patch can prefetch those just fine, of
>> course. I don't know which case is more likely.
> 
> We discussed this in Montreal, but I'd like to respond to this point
> again on list:
> 
> I don't think that array keys are in any way relevant to the design of
> this patch. Nothing I've said about this project has anything to do
> with array keys, except when I was concerned about specific bugs in
> the patch. (Bugs that I've now fixed in a way that is wholly confined
> to nbtree.)
> 
> The overarching goal of my work on nbtree array scans was to make them
> work just like other scans to the maximum extent possible. Array scans
> "where each array key matches a single tuple on a different leaf page"
> are virtually identical to any other scan that'll return only one or
> two tuples from each neighboring page. You could see a similar pattern
> with literally any kind of key.
> 
> Again, what I'm concerned about is coming up with a design that gives
> scans maximum freedom to reorder work (not necessarily in the first
> committed version), so that we can keep the read stream busy by giving
> it sufficiently many heap pages to read: a truly adaptive design, that
> weighs all relevant costs. Sometimes that'll necessitate eagerly
> reading leaf pages. There is nothing fundamentally complicated about
> that idea. Nothing in index AMs cares about how or when heap accesses
> take place.
> 
> Again, it just *makes sense* to centralize the code that controls the
> progress of ordered/amgettuple scans. Every affected index AM is
> already doing virtually the same thing as each other. They're all
> following the rules around index locking/pinning for amgettuple [1].
> Individual index AMs are *already* required to read leaf pages a
> certain way, in a certain order *relative to the heap accesses*. All
> for the benefit of scan correctness (to avoid breaking things in a way
> that relates to heapam implementation details).
> 
> Why wouldn't we want to relieve all AMs of that responsibility?
> Leaving it up to index AMs has resulted in subtle bugs [2][3], and
> AFAICT has no redeeming quality. If affected index AMs were *forced*
> to do *exactly* the same thing as each other (not just *oblidged* to
> do *almost* the same thing), it would make life easier for everybody.
> 
> [1] https://www.postgresql.org/docs/current/index-locking.html
> [2] https://commitfest.postgresql.org/patch/5721/
> [3] https://commitfest.postgresql.org/patch/5542/

Thanks.

I don't remember the array key details, I'll need to swap the context
back in. But I think the thing I've been concerned about the most is the
coordination of advancing to the next leaf page vs. the next array key
(and then perhaps having to go back when the scan direction changes).


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Proposal: Role Sandboxing for Secure Impersonation
Next
From: Peter Geoghegan
Date:
Subject: Re: index prefetching