Re: index prefetching - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | bpdeohyqvltb77viyft4bza4xc4peed3jcoep74d2ih6ynqlke@wbnhcwmq3ril Whole thread Raw |
In response to | Re: index prefetching (Andres Freund <andres@anarazel.de>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
Hi, I spent a fair bit more time analyzing this issue. On 2025-08-28 21:10:48 -0400, Andres Freund wrote: > On 2025-08-28 19:57:17 -0400, Peter Geoghegan wrote: > > On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote: > > I'm not sure that Thomas'/your patch to ameliorate the problem on the > > read stream side is essential here. Perhaps Andres can just take a > > look at the test case + feature branch, without the extra patches. > > That way he'll be able to see whatever the immediate problem is, which > > might be all we need. > > It seems caused to a significant degree by waiting at low queue depths. If I > comment out the stream->distance-- in read_stream_start_pending_read() the > regression is reduced greatly. > > As far as I can tell, after that the process is CPU bound, i.e. IO waits don't > play a role. Indeed the actual AIO subsystem is unrelated, from what I can tell: I hacked up read_stream.c/bufmgr.c to do readahead even if the buffer is in shared_buffers. With that, the negative performance impact of doing enable_indexscan_prefetch=1 is of a similar magnitude even if the table is already entirely in shared buffers. I.e. actual IO is unrelated. I compared perf stat -ddd output for enable_indexscan_prefetch=0 with enable_indexscan_prefetch=1. The only real difference is a substantial (~3x) increase in branch misses. I then took a perf profile to see where all those misses are from. The first souce is: > I see a variety for increased CPU usage: > > 1) The private ref count infrastructure in bufmgr.c gets a bit slower once > more buffers are pinned The problem mainly seems to be that the branches in the loop at the start of GetPrivateRefCountEntry() are entirely unpredictable in this workload. I had an old patch that tried to make it possible to use SIMD for the search, by using a separate array for the Buffer ids - with that gcc generates fairly crappy code, but does make the code branchless. Here that substantially reduces the overhead of doing prefetching. Afterwards it's not a meaningful source of misses anymore. > 3) same issue with the resowner tracking This one is much harder to address: a) The "key" we are searching for is much wider (16 bytes), making vectorization of the search less helpful b) because we search up to owner->narr instead of a fixed-length, the compiler wouldn't be able to auto-vectorize anyway c) the branch-misses are partially caused by ResourcOwnerForget() "scrambling" the order in the array when forgetting an element I don't know how to fix this right now. I nevertheless wanted to see how big the impact of this is, so I just neutered ResourceOwner{Remember,Forget}{Buffer,BufferIO} - that's obviously not correct, but suffices to see that the performance difference reduces substantially. But not completely, unfortunately. > But there's some additional difference in performance I don't yet > understand... I still don't think I fully understand why the impact of this is so large. The branch misses appear to be the only thing differentiating the two cases, but with resowners neutralized, the remaining difference in branch misses seems too large - it's not like the sequence of block numbers is more predictable without prefetching... The main increase in branch misses is in index_scan_stream_read_next... Greetings, Andres Freund
pgsql-hackers by date: