Home > mailing lists

Re: index prefetching - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: index prefetching
Date	August 28 15:45:24
Msg-id	1c9302da-c834-4773-a527-1c1a7029c5a3@vondra.me Whole thread Raw
In response to	Re: index prefetching (Tomas Vondra <tomas@vondra.me>)
Responses	Re: index prefetching
List	pgsql-hackers

Tree view

On 8/26/25 17:06, Tomas Vondra wrote:
> 
> 
> On 8/26/25 01:48, Andres Freund wrote:
>> Hi,
>>
>> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>>> 
>>> ...
>>>
>>> I'm not sure what's causing this, but almost all regressions my script
>>> is finding look like this - always io_method=worker, with distance close
>>> to 2.0. Is this some inherent io_method=worker overhead?
>>
>> I think what you might be observing might be the inherent IPC / latency
>> overhead of the worker based approach. This is particularly pronounced if the
>> workers are idle (and the CPU they get scheduled on is clocked down). The
>> latency impact of that is small, but if you never actually get to do much
>> readahead it can be visible.
>>
> 
> Yeah, that's quite possible. If I understand the mechanics of this, this
> can behave in a rather unexpected way - lowering the load (i.e. issuing
> fewer I/O requests) can make the workers "more idle" and therefore more
> likely to get suspended ...
> 
> Is there a good way to measure if this is what's happening, and the
> impact? For example, it'd be interesting to know how long it took for a
> submitted process to get picked up by a worker. And % of time a worker
> spent handling I/O.
> 

I kept thinking about this, and in the end I decided to try to measure
this IPC overhead. The backend/ioworker communicate by sending signals,
so I wrote a simple C program that does "signal echo" with two processes
(one fork). It works like this:

1) fork a child process
2) send a signal to the child
3) child notices the signal, sends a response signal back
4) after receiving response, go back to (2)

This happens until the requested number of signals is sent, and then it
prints stats like signals/second etc. The C file is attached, I'm sure
it's imperfect but it does the trick.

And the results mostly agree with the benchmark results from yesterday.
Which makes sense, because if the distance collapses to ~1, the AIO with
io_method=worker starts doing about the same thing for every block.

If I run the signal test on the ryzen machine, I get this:

-----------------------------------------------------------------------
root@ryzen:~# ./signal-echo 1000000
nmm_signals = 1000000
parent: sent 100000 signals in 196909 us (1.97)
...
parent: sent 1000000 signals in 1924263 us (1.92 us)
signals / sec = 519679.48
-----------------------------------------------------------------------

So it can do about 500k signals / second. This means that requesting
blocks one by one (with distance=1), a single worker can do about 4GB/s,
assuming there's no other work (no actual I/O, no checksum checks, ...).

Consider the warm runs with 512MB shared buffers, which means there's no
I/O but the data needs to be copied from page cache (by the worker). An
explain analyze for the query says this:

         Buffers: shared hit=2573018 read=455610

That's 455610 blocks to read, mostly one by one. So a bit less than 1
second just for the IPC, but there's also the memcpy etc. An example
result from the benchmark looks like this:

master: 967ms
patched: 2353ms

So that's ~1400ms difference. So a bit more, but in the right ballpark,
and the extra overhead could be the due to AIO being more complex than
sync I/O, etc. Not sure.

The xeon can do ~190k signals/second, i.e. about 1/3 of ryzen, so the
index scan would spend ~3 seconds on the IPC. Timings for the same test
look like this:

master: 3049ms
patched: 9636ms

So, that's about 2x the expected difference. Not sure where the extra
overhead comes from, might be due to NUMA (which the ryzen does not have).

So I think the IPC overhead with "worker" can be quite significant,
especially for cases with distance=1. I don't think it's a major issue
for PG18, because seq/bitmap scans are unlikely to collapse the distance
like this. And with larger distances the cost amortizes. It's much
bigger issue for the index prefetching, it seems.

This is for the "warm" runs with 512MB, with the basic prefetch patch.
I'm not sure it explains the overhead with the patches that increase the
prefetch distance (be it mine or Thomas' patch), or cold runs. The
regresions seem to be smaller in those cases, though.

regards

-- 
Tomas Vondra

Attachment

signal-echo.c

pgsql-hackers by date:

From: Nisha Moond
Date: 28 August, 15:30:08
Subject: Re: Conflict detection for update_deleted in logical replication

From: Konstantin Knizhnik
Date: 28 August, 15:57:27
Subject: Re: Non-reproducible AIO failure

Re: index prefetching - Mailing list pgsql-hackers

Attachment

Previous

Next