Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 293a4735-79a4-499c-9a36-870ee9286281@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
List | pgsql-hackers |
On 8/26/25 17:06, Tomas Vondra wrote: > > > On 8/26/25 01:48, Andres Freund wrote: >> Hi, >> >> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote: >>> Thanks. Based on the testing so far, the patch seems to be a substantial >>> improvement. What's needed to make this prototype committable? >> >> Mainly some testing infrastructure that can trigger this kind of stream. The >> logic is too finnicky for me to commit it without that. >> > > So, what would that look like? The "naive" approach to testing is to > simply generate a table/index, producing the right sequence of blocks. > That shouldn't be too hard, it'd be enough to have an index that > > - has ~2-3 rows per value, on different heap pages > - the values "overlap", e.g. like this (value,page) > > (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ... > > Another approach would be to test this at C level, sidestepping the > query execution entirely. We'd have a "stream generator" that just > generates a sequence of blocks of our own choosing (could be hard-coded, > some pattern, read from a file ...), and feed it into a read stream. > > But how would we measure success for these tests? I don't think we want > to look at query duration, that's very volatile. > >> >>> I assume this is PG19+ improvement, right? It probably affects PG18 too, >>> but it's harder to hit / the impact is not as bad as on PG19. >> >> Yea. It does apply to 18 too, but I can't come up with realistic scenarios >> where it's a real issue. I can repro a slowdown when using many parallel >> seqscans with debug_io_direct=data - but that's even slower in 17... >> > > Makes sense. > >> >>> On a related note, my test that generates random datasets / queries, and >>> compares index prefetching with different io_method values found a >>> pretty massive difference between worker and io_uring. I wonder if this >>> might be some issue in io_method=worker. >> >>> while with index prefetching (with the aio prototype patch), it looks >>> like this: >>> >>> QUERY PLAN >>> ---------------------------------------------------------------------- >>> Index Scan using idx on t (actual rows=9048576.00 loops=1) >>> Index Cond: ((a >= 16150) AND (a <= 4540437)) >>> Index Searches: 1 >>> Prefetch Distance: 2.032 >>> Prefetch Count: 868165 >>> Prefetch Stalls: 2140228 >>> Prefetch Skips: 6039906 >>> Prefetch Resets: 0 >>> Stream Ungets: 0 >>> Stream Forwarded: 4 >>> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 >>> Buffers: shared hit=2577599 read=455610 >>> Planning: >>> Buffers: shared hit=78 read=26 dirtied=1 >>> Planning Time: 1.032 ms >>> Execution Time: 3150.578 ms >>> (16 rows) >>> >>> So it's about 2x slower. The prefetch distance collapses, because >>> there's a lot of cache hits (about 50% of requests seem to be hits of >>> already visited blocks). I think that's a problem with how we adjust the >>> distance, but I'll post about that separately. >>> >>> Let's try to simply set io_method=io_uring: >>> >>> QUERY PLAN >>> ---------------------------------------------------------------------- >>> Index Scan using idx on t (actual rows=9048576.00 loops=1) >>> Index Cond: ((a >= 16150) AND (a <= 4540437)) >>> Index Searches: 1 >>> Prefetch Distance: 2.032 >>> Prefetch Count: 868165 >>> Prefetch Stalls: 2140228 >>> Prefetch Skips: 6039906 >>> Prefetch Resets: 0 >>> Stream Ungets: 0 >>> Stream Forwarded: 4 >>> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 >>> Buffers: shared hit=2577599 read=455610 >>> Planning: >>> Buffers: shared hit=78 read=26 >>> Planning Time: 2.212 ms >>> Execution Time: 1837.615 ms >>> (16 rows) >>> >>> That's much closer to master (and the difference could be mostly noise). >>> >>> I'm not sure what's causing this, but almost all regressions my script >>> is finding look like this - always io_method=worker, with distance close >>> to 2.0. Is this some inherent io_method=worker overhead? >> >> I think what you might be observing might be the inherent IPC / latency >> overhead of the worker based approach. This is particularly pronounced if the >> workers are idle (and the CPU they get scheduled on is clocked down). The >> latency impact of that is small, but if you never actually get to do much >> readahead it can be visible. >> > > Yeah, that's quite possible. If I understand the mechanics of this, this > can behave in a rather unexpected way - lowering the load (i.e. issuing > fewer I/O requests) can make the workers "more idle" and therefore more > likely to get suspended ... > > Is there a good way to measure if this is what's happening, and the > impact? For example, it'd be interesting to know how long it took for a > submitted process to get picked up by a worker. And % of time a worker > spent handling I/O. > After investigating this a bit more, I'm not sure it's due to workers getting idle / CPU clocked down, etc. I did an experiment with booting with idle=poll, which AFAICS should prevent cores from idling, etc. And it made pretty much no difference - timings didn't change. It can still be about IPC, but it does not seem to be about clocked-down cores, or stuff like that. Maybe. I ran a more extensive set of tests, varying additional parameters: - iomethod: io_uring / worker (3 or 12 workers) - shared buffers: 512MB / 16GB (table is ~3GB) - checksums on / off - eic: 16 / 100 - difference SSD devices and comparing master vs. builds with different variants of the patches: - master - patched (index prefetching) - no-explain (EXPLAIN ANALYZE reverted) - munro / vondra (WIP patches preventing distance collapse) - munro-no-explain / vondra-no-explain (should be obvious) We've been speculating (me and Peter) maybe the extra read_stream stats add a lot of overhead, hence the "no-explain" builds to test that. All of this is with the recent "aio" patch eliminating I/O waits. Attached are results from my "ryzen" machine (xeon is very similar), sliced/colored to show patterns. It's for query: SELECT * FROM ( SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437 ORDER BY a ASC ) OFFSET 1000000000; Which is the same query as before, except that it's not EXPLAIN ANALYZE, and it has OFFSET so that it does not send any data back. It's a bit of an adversarial query, it doesn't seem to benefit from prefetching. There are some very clear patterns in the results. In the "cold" (uncached) runs: * io_uring does much better, with limited regressions (not negligible, but limited compared to io_method=worker). A hint this may really be about IPC? * With worker, there's a massive regression with the basic prefetching patch (when the distance collapses to 2.0). But then it mostly recovers with the increased distance, and even does a bit better than master (or on part with io_uring) In the "warm" runs (with everything cached in page cache, possibly even in shared buffers): * With 16GB shared buffers, the regressions are about the same as for cold runs, both for io_uring and worker. Roughly ~5%, give or take. The extra read_stream stats seem to add ~3%. * With 512MB it's much more complicated. io_uring regresses much more (relative to master), for some reason. For cold runs it was ~30%, now it's ~50%. Seems weird, but I guess there's fixed overhead and it's more visible with data in cache. * For worker (with buffers=512MB), the basic patch clearly causes a massive regression, it's about 2x slower. I don't really understand why - the assumption was this is because of idling, but is it, if it happens with idle=poll? In top, I see the backend takes ~60%, and the io worker ~40% (so they clearly ping-pong the work). 40% utilization does not seem particularly low (and with idle=poll it should not idle anyway). I realize there's IPC with worker, and it's going to be more visible for cases that end up doing no prefetching. But isn't 2x regression a bit too hign? I wouldn't have expected that. Any good way to measure how expensive the IPC is? * With the increased prefetch distance, the regression drops to ~25% (for worker). And in top I see the backend takes ~100%, and the single worker uses ~60%. But the 25% is without checksums. With checksums, the regression is roughly the 5%. I'm not sure what to think about this. -- Tomas Vondra
Attachment
pgsql-hackers by date: