Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: index prefetching
Re: index prefetching Re: index prefetching |
List | pgsql-hackers |
On 8/20/25 00:27, Peter Geoghegan wrote: > On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote: >> That definitely seems like a problem. I think that you're saying that >> this problem happens because we have extra buffer hits earlier on, >> which is enough to completely change the ramp-up behavior. This seems >> to be all it takes to dramatically decrease the effectiveness of >> prefetching. Does that summary sound correct? > That summary is correct, yes. I kept thinking about this, while looking at more regressions found by my script (that generates data sets with different data distributions, etc.). Almost all regressions (at least the top ones) now look like this, i.e. distance collapses to ~2.0, which essentially disables prefetching. But I no longer think it's caused by the "priorbatch" optimization, which delays read stream creation until after the first batch. I still think we may need to rethink that (e.g. if the first batch is huge), but he distance can "collapse" even without it. The optimization just makes it easier to happen. AFAICS the distance collapse is "inherent" to how the distance gets increased/decreased after hits/misses. Let's start with distance=1, and let's assume 50% of buffers are hits, in a regular pattern - hit-miss-hit-miss-hit-miss-... In this case, the distance will never increase beyond 2, because we'll double-decrement-double-decrement-... so it'll flip between 1 and 2, no matter how you set effective_io_concurrency. Of course, this can happen even with other hit ratios, there's nothing special about 50%. With fewer hits, it's fine - there's asymmetry, because the distance grows by doubling and decreases by decrementing 1. So once we have a bit more misses, it keeps growing. But with more hits, the hit/miss ratio simply determines the "stable" distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the stable distance is ~4, because we get a miss, double to 8, and then 4 hits, so the distance drops back to 4. And again. Similarly for other hit/miss ratios (it's easier to think about if you keep the number of hits 2^n). It's worth noticing the effective_io_concurrency has almost no impact on what distance we end up with, it merely limits the maximum distance. I find this distance heuristics a bit strange, for a couple reasons: * It doesn't seem right to get stuck at distance=2 with 50% misses. Surely that would benefit from prefetching a bit more? * It mostly ignores effective_io_concurrency, which I think about as "Keep this number of I/Os in the queue." But we don't try doing that. I understand the current heuristics is trying to not prefetch for cached data sets, but does that actually make sense? With fadvise it made sense, because the prefetched data could get evicted if we prefetched too far ahead. But with worker/io_uring the buffers get pinned, so this shouldn't happen. Of course, that doesn't mean we should prefetch too far ahead - there's LIMIT queries and limit of buffer pins, etc. What about if the distance heuristics asks this question: How far do we need to look to generate effective_io_concurrency IOs? The attached patch is a PoC implementing this. The core idea is that if we measure "miss probability" for a chunk of requests, we can use that to estimate the distance needed to generate e_i_c IOs. So while the current heuristics looks at individual hits/misses, the patch looks at groups of requests. The other idea is that the patch maintains a "distance range", with min/max of allowed distances. The min/max values gradually grow after a miss, the "min" value "stops" at max_ios, while "max" grows further. This ensures gradual ramp up, helping LIMIT queries etc. And even if there are a lot of hits, the distance is not allowed to drop below the current "min". Because what would be the benefit of that? - If the read is a hit, we might read it later - but the cost is about the same, we're not really saving much by delaying the read. - If the read is a miss, it's clearly better to issue the I/O sooner. This may not be true if it's a LIMIT query, and it terminates early. But if the distance_min is not too high, this should be negligible. Attached is an example table/query, found by my script. Without the read_stream patch (i.e. just with the current index prefetching), it looks like this: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Prefetch Distance: 2.032 Prefetch Count: 868165 Prefetch Stalls: 2140228 Prefetch Skips: 6039906 Prefetch Resets: 0 Stream Ungets: 0 Stream Forwarded: 4 Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 Buffers: shared hit=2577599 read=455610 Planning: Buffers: shared hit=78 read=26 dirtied=1 Planning Time: 1.032 ms Execution Time: 3150.578 ms (16 rows) and with the attached patch: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Prefetch Distance: 36.321 Prefetch Count: 3730750 Prefetch Stalls: 3 Prefetch Skips: 6039906 Prefetch Resets: 0 Stream Ungets: 722353 Stream Forwarded: 305265 Prefetch Histogram: [2,4) => 10, [4,8) => 11, [8,16) => 6, [16,32) => 316890, [32,64) => 3413833 Buffers: shared hit=2574776 read=455610 Planning: Buffers: shared hit=78 read=26 Planning Time: 2.249 ms Execution Time: 1651.826 ms (16 rows) The example is not entirely perfect, because the index prefetching does not actually beat master: QUERY PLAN ---------------------------------------------------------------------- Index Scan using idx on t (actual rows=9048576.00 loops=1) Index Cond: ((a >= 16150) AND (a <= 4540437)) Index Searches: 1 Buffers: shared hit=2577599 read=455610 Planning: Buffers: shared hit=78 read=26 Planning Time: 3.688 ms Execution Time: 1656.790 ms (8 rows) So it's more a case of "mitigating a regression" (finding regressions like this is the purpose of my script). Still, I believe the questions about the distance heuristics are valid. (Another interesting detail is that the regression happens only with io_method=worker, not with io_uring. I'm not sure why.) regards -- Tomas Vondra
Attachment
pgsql-hackers by date: