Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 293a4735-79a4-499c-9a36-870ee9286281@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers

On 8/26/25 17:06, Tomas Vondra wrote:
> 
> 
> On 8/26/25 01:48, Andres Freund wrote:
>> Hi,
>>
>> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>>> Thanks. Based on the testing so far, the patch seems to be a substantial
>>> improvement. What's needed to make this prototype committable?
>>
>> Mainly some testing infrastructure that can trigger this kind of stream. The
>> logic is too finnicky for me to commit it without that.
>>
> 
> So, what would that look like? The "naive" approach to testing is to
> simply generate a table/index, producing the right sequence of blocks.
> That shouldn't be too hard, it'd be enough to have an index that
> 
> - has ~2-3 rows per value, on different heap pages
> - the values "overlap", e.g. like this (value,page)
> 
>    (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...
> 
> Another approach would be to test this at C level, sidestepping the
> query execution entirely. We'd have a "stream generator" that just
> generates a sequence of blocks of our own choosing (could be hard-coded,
> some pattern, read from a file ...), and feed it into a read stream.
> 
> But how would we measure success for these tests? I don't think we want
> to look at query duration, that's very volatile.
> 
>>
>>> I assume this is PG19+ improvement, right? It probably affects PG18 too,
>>> but it's harder to hit / the impact is not as bad as on PG19.
>>
>> Yea. It does apply to 18 too, but I can't come up with realistic scenarios
>> where it's a real issue. I can repro a slowdown when using many parallel
>> seqscans with debug_io_direct=data - but that's even slower in 17...
>>
> 
> Makes sense.
> 
>>
>>> On a related note, my test that generates random datasets / queries, and
>>> compares index prefetching with different io_method values found a
>>> pretty massive difference between worker and io_uring. I wonder if this
>>> might be some issue in io_method=worker.
>>
>>> while with index prefetching (with the aio prototype patch), it looks
>>> like this:
>>>
>>>                                 QUERY PLAN
>>>   ----------------------------------------------------------------------
>>>    Index Scan using idx on t (actual rows=9048576.00 loops=1)
>>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>>      Index Searches: 1
>>>      Prefetch Distance: 2.032
>>>      Prefetch Count: 868165
>>>      Prefetch Stalls: 2140228
>>>      Prefetch Skips: 6039906
>>>      Prefetch Resets: 0
>>>      Stream Ungets: 0
>>>      Stream Forwarded: 4
>>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>>      Buffers: shared hit=2577599 read=455610
>>>    Planning:
>>>      Buffers: shared hit=78 read=26 dirtied=1
>>>    Planning Time: 1.032 ms
>>>    Execution Time: 3150.578 ms
>>>   (16 rows)
>>>
>>> So it's about 2x slower. The prefetch distance collapses, because
>>> there's a lot of cache hits (about 50% of requests seem to be hits of
>>> already visited blocks). I think that's a problem with how we adjust the
>>> distance, but I'll post about that separately.
>>>
>>> Let's try to simply set io_method=io_uring:
>>>
>>>                                 QUERY PLAN
>>>   ----------------------------------------------------------------------
>>>    Index Scan using idx on t  (actual rows=9048576.00 loops=1)
>>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>>      Index Searches: 1
>>>      Prefetch Distance: 2.032
>>>      Prefetch Count: 868165
>>>      Prefetch Stalls: 2140228
>>>      Prefetch Skips: 6039906
>>>      Prefetch Resets: 0
>>>      Stream Ungets: 0
>>>      Stream Forwarded: 4
>>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>>      Buffers: shared hit=2577599 read=455610
>>>    Planning:
>>>      Buffers: shared hit=78 read=26
>>>    Planning Time: 2.212 ms
>>>    Execution Time: 1837.615 ms
>>>   (16 rows)
>>>
>>> That's much closer to master (and the difference could be mostly noise).
>>>
>>> I'm not sure what's causing this, but almost all regressions my script
>>> is finding look like this - always io_method=worker, with distance close
>>> to 2.0. Is this some inherent io_method=worker overhead?
>>
>> I think what you might be observing might be the inherent IPC / latency
>> overhead of the worker based approach. This is particularly pronounced if the
>> workers are idle (and the CPU they get scheduled on is clocked down). The
>> latency impact of that is small, but if you never actually get to do much
>> readahead it can be visible.
>>
> 
> Yeah, that's quite possible. If I understand the mechanics of this, this
> can behave in a rather unexpected way - lowering the load (i.e. issuing
> fewer I/O requests) can make the workers "more idle" and therefore more
> likely to get suspended ...
> 
> Is there a good way to measure if this is what's happening, and the
> impact? For example, it'd be interesting to know how long it took for a
> submitted process to get picked up by a worker. And % of time a worker
> spent handling I/O.
> 

After investigating this a bit more, I'm not sure it's due to workers
getting idle / CPU clocked down, etc. I did an experiment with booting
with idle=poll, which AFAICS should prevent cores from idling, etc.

And it made pretty much no difference - timings didn't change. It can
still be about IPC, but it does not seem to be about clocked-down cores,
or stuff like that. Maybe.

I ran a more extensive set of tests, varying additional parameters:

- iomethod: io_uring / worker (3 or 12 workers)
- shared buffers: 512MB / 16GB (table is ~3GB)
- checksums on / off
- eic: 16 / 100
- difference SSD devices

and comparing master vs. builds with different variants of the patches:

- master
- patched (index prefetching)
- no-explain (EXPLAIN ANALYZE reverted)
- munro / vondra (WIP patches preventing distance collapse)
- munro-no-explain / vondra-no-explain (should be obvious)

We've been speculating (me and Peter) maybe the extra read_stream stats
add a lot of overhead, hence the "no-explain" builds to test that. All
of this is with the recent "aio" patch eliminating I/O waits.

Attached are results from my "ryzen" machine (xeon is very similar),
sliced/colored to show patterns. It's for query:

    SELECT * FROM (
        SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437
        ORDER BY a ASC
    ) OFFSET 1000000000;

Which is the same query as before, except that it's not EXPLAIN ANALYZE,
and it has OFFSET so that it does not send any data back. It's a bit of
an adversarial query, it doesn't seem to benefit from prefetching.

There are some very clear patterns in the results.

In the "cold" (uncached) runs:

* io_uring does much better, with limited regressions (not negligible,
but limited compared to io_method=worker). A hint this may really be
about IPC?

* With worker, there's a massive regression with the basic prefetching
patch (when the distance collapses to 2.0). But then it mostly recovers
with the increased distance, and even does a bit better than master (or
on part with io_uring)

In the "warm" runs (with everything cached in page cache, possibly even
in shared buffers):

* With 16GB shared buffers, the regressions are about the same as for
cold runs, both for io_uring and worker. Roughly ~5%, give or take. The
extra read_stream stats seem to add ~3%.

* With 512MB it's much more complicated. io_uring regresses much more
(relative to master), for some reason. For cold runs it was ~30%, now
it's ~50%. Seems weird, but I guess there's fixed overhead and it's more
visible with data in cache.

* For worker (with buffers=512MB), the basic patch clearly causes a
massive regression, it's about 2x slower. I don't really understand why
- the assumption was this is because of idling, but is it, if it happens
with idle=poll?

In top, I see the backend takes ~60%, and the io worker ~40% (so they
clearly ping-pong the work). 40% utilization does not seem particularly
low (and with idle=poll it should not idle anyway).

I realize there's IPC with worker, and it's going to be more visible for
cases that end up doing no prefetching. But isn't 2x regression a bit
too hign? I wouldn't have expected that. Any good way to measure how
expensive the IPC is?

* With the increased prefetch distance, the regression drops to ~25%
(for worker). And in top I see the backend takes ~100%, and the single
worker uses ~60%. But the 25% is without checksums. With checksums, the
regression is roughly the 5%.

I'm not sure what to think about this.

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Fabrice Chapuis
Date:
Subject: Re: Issue with logical replication slot during switchover
Next
From: Xuneng Zhou
Date:
Subject: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting