Home > mailing lists
Re: index prefetching - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: index prefetching
Date	August 25 17:18:00
Msg-id	8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me Whole thread Raw
In response to	Re: index prefetching (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: index prefetching Re: index prefetching Re: index prefetching
List	pgsql-hackers
Tree view
On 8/20/25 00:27, Peter Geoghegan wrote:
> On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> That definitely seems like a problem. I think that you're saying that
>> this problem happens because we have extra buffer hits earlier on,
>> which is enough to completely change the ramp-up behavior. This seems
>> to be all it takes to dramatically decrease the effectiveness of
>> prefetching. Does that summary sound correct?
> 

That summary is correct, yes. I kept thinking about this, while looking
at more regressions found by my script (that generates data sets with
different data distributions, etc.).

Almost all regressions (at least the top ones) now look like this, i.e.
distance collapses to ~2.0, which essentially disables prefetching.

But I no longer think it's caused by the "priorbatch" optimization,
which delays read stream creation until after the first batch. I still
think we may need to rethink that (e.g. if the first batch is huge), but
he distance can "collapse" even without it. The optimization just makes
it easier to happen.

AFAICS the distance collapse is "inherent" to how the distance gets
increased/decreased after hits/misses.

Let's start with distance=1, and let's assume 50% of buffers are hits,
in a regular pattern - hit-miss-hit-miss-hit-miss-...

In this case, the distance will never increase beyond 2, because we'll
double-decrement-double-decrement-... so it'll flip between 1 and 2, no
matter how you set effective_io_concurrency.

Of course, this can happen even with other hit ratios, there's nothing
special about 50%.

With fewer hits, it's fine - there's asymmetry, because the distance
grows by doubling and decreases by decrementing 1. So once we have a bit
more misses, it keeps growing.

But with more hits, the hit/miss ratio simply determines the "stable"
distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the
stable distance is ~4, because we get a miss, double to 8, and then 4
hits, so the distance drops back to 4. And again.

Similarly for other hit/miss ratios (it's easier to think about if you
keep the number of hits 2^n).

It's worth noticing the effective_io_concurrency has almost no impact on
what distance we end up with, it merely limits the maximum distance.

I find this distance heuristics a bit strange, for a couple reasons:

* It doesn't seem right to get stuck at distance=2 with 50% misses.
Surely that would benefit from prefetching a bit more?

* It mostly ignores effective_io_concurrency, which I think about as
"Keep this number of I/Os in the queue." But we don't try doing that.

I understand the current heuristics is trying to not prefetch for cached
data sets, but does that actually make sense? With fadvise it made
sense, because the prefetched data could get evicted if we prefetched
too far ahead. But with worker/io_uring the buffers get pinned, so this
shouldn't happen. Of course, that doesn't mean we should prefetch too
far ahead - there's LIMIT queries and limit of buffer pins, etc.

What about if the distance heuristics asks this question:

  How far do we need to look to generate effective_io_concurrency IOs?

The attached patch is a PoC implementing this. The core idea is that if
we measure "miss probability" for a chunk of requests, we can use that
to estimate the distance needed to generate e_i_c IOs.

So while the current heuristics looks at individual hits/misses, the
patch looks at groups of requests.

The other idea is that the patch maintains a "distance range", with
min/max of allowed distances. The min/max values gradually grow after a
miss, the "min" value "stops" at max_ios, while "max" grows further.

This ensures gradual ramp up, helping LIMIT queries etc.

And even if there are a lot of hits, the distance is not allowed to drop
below the current "min". Because what would be the benefit of that?

- If the read is a hit, we might read it later - but the cost is about
the same, we're not really saving much by delaying the read.

- If the read is a miss, it's clearly better to issue the I/O sooner.

This may not be true if it's a LIMIT query, and it terminates early. But
if the distance_min is not too high, this should be negligible.

Attached is an example table/query, found by my script. Without the
read_stream patch (i.e. just with the current index prefetching), it
looks like this:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 2.032
     Prefetch Count: 868165
     Prefetch Stalls: 2140228
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 0
     Stream Forwarded: 4
     Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26 dirtied=1
   Planning Time: 1.032 ms
   Execution Time: 3150.578 ms
  (16 rows)

and with the attached patch:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t  (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 36.321
     Prefetch Count: 3730750
     Prefetch Stalls: 3
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 722353
     Stream Forwarded: 305265
     Prefetch Histogram: [2,4) => 10, [4,8) => 11, [8,16) => 6,
                         [16,32) => 316890, [32,64) => 3413833
     Buffers: shared hit=2574776 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 2.249 ms
   Execution Time: 1651.826 ms
  (16 rows)

The example is not entirely perfect, because the index prefetching does
not actually beat master:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t   (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 3.688 ms
   Execution Time: 1656.790 ms
  (8 rows)

So it's more a case of "mitigating a regression" (finding regressions
like this is the purpose of my script). Still, I believe the questions
about the distance heuristics are valid.

(Another interesting detail is that the regression happens only with
io_method=worker, not with io_uring. I'm not sure why.)


regards

-- 
Tomas Vondra
Attachment

pgsql-hackers by date:
From: Mihail Nikalayeu
Date: 25 August, 17:15:16
Subject: Re: Adding REPACK [concurrently]
From: Nathan Bossart
Date: 25 August, 17:29:15
Subject: Re: vacuumdb --missing-stats-only and permission issue
Re: index prefetching - Mailing list pgsql-hackers

Attachment

Previous

Next