Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 1c9302da-c834-4773-a527-1c1a7029c5a3@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On 8/26/25 17:06, Tomas Vondra wrote: > > > On 8/26/25 01:48, Andres Freund wrote: >> Hi, >> >> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote: >>> >>> ... >>> >>> I'm not sure what's causing this, but almost all regressions my script >>> is finding look like this - always io_method=worker, with distance close >>> to 2.0. Is this some inherent io_method=worker overhead? >> >> I think what you might be observing might be the inherent IPC / latency >> overhead of the worker based approach. This is particularly pronounced if the >> workers are idle (and the CPU they get scheduled on is clocked down). The >> latency impact of that is small, but if you never actually get to do much >> readahead it can be visible. >> > > Yeah, that's quite possible. If I understand the mechanics of this, this > can behave in a rather unexpected way - lowering the load (i.e. issuing > fewer I/O requests) can make the workers "more idle" and therefore more > likely to get suspended ... > > Is there a good way to measure if this is what's happening, and the > impact? For example, it'd be interesting to know how long it took for a > submitted process to get picked up by a worker. And % of time a worker > spent handling I/O. > I kept thinking about this, and in the end I decided to try to measure this IPC overhead. The backend/ioworker communicate by sending signals, so I wrote a simple C program that does "signal echo" with two processes (one fork). It works like this: 1) fork a child process 2) send a signal to the child 3) child notices the signal, sends a response signal back 4) after receiving response, go back to (2) This happens until the requested number of signals is sent, and then it prints stats like signals/second etc. The C file is attached, I'm sure it's imperfect but it does the trick. And the results mostly agree with the benchmark results from yesterday. Which makes sense, because if the distance collapses to ~1, the AIO with io_method=worker starts doing about the same thing for every block. If I run the signal test on the ryzen machine, I get this: ----------------------------------------------------------------------- root@ryzen:~# ./signal-echo 1000000 nmm_signals = 1000000 parent: sent 100000 signals in 196909 us (1.97) ... parent: sent 1000000 signals in 1924263 us (1.92 us) signals / sec = 519679.48 ----------------------------------------------------------------------- So it can do about 500k signals / second. This means that requesting blocks one by one (with distance=1), a single worker can do about 4GB/s, assuming there's no other work (no actual I/O, no checksum checks, ...). Consider the warm runs with 512MB shared buffers, which means there's no I/O but the data needs to be copied from page cache (by the worker). An explain analyze for the query says this: Buffers: shared hit=2573018 read=455610 That's 455610 blocks to read, mostly one by one. So a bit less than 1 second just for the IPC, but there's also the memcpy etc. An example result from the benchmark looks like this: master: 967ms patched: 2353ms So that's ~1400ms difference. So a bit more, but in the right ballpark, and the extra overhead could be the due to AIO being more complex than sync I/O, etc. Not sure. The xeon can do ~190k signals/second, i.e. about 1/3 of ryzen, so the index scan would spend ~3 seconds on the IPC. Timings for the same test look like this: master: 3049ms patched: 9636ms So, that's about 2x the expected difference. Not sure where the extra overhead comes from, might be due to NUMA (which the ryzen does not have). So I think the IPC overhead with "worker" can be quite significant, especially for cases with distance=1. I don't think it's a major issue for PG18, because seq/bitmap scans are unlikely to collapse the distance like this. And with larger distances the cost amortizes. It's much bigger issue for the index prefetching, it seems. This is for the "warm" runs with 512MB, with the basic prefetch patch. I'm not sure it explains the overhead with the patches that increase the prefetch distance (be it mine or Thomas' patch), or cold runs. The regresions seem to be smaller in those cases, though. regards -- Tomas Vondra
Attachment
pgsql-hackers by date: