Re: AIO / read stream heuristics adjustments for index prefetching - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO / read stream heuristics adjustments for index prefetching
Date
Msg-id 3gkuvs3lz3u3skuaxfkxnsysfqslf2srigl6546vhesekve6v2@va3r5esummvg
Whole thread
In response to Re: AIO / read stream heuristics adjustments for index prefetching  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

There are a bunch of heuristics mentioned in the following proposed commit:

On 2026-04-03 16:36:03 -0400, Andres Freund wrote:
> Subject: [PATCH v5 1/5] aio: io_uring: Trigger async processing for large IOs
>
> io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
> once the IO depth is a bit larger. That heuristic is important when doing
> buffered IO from the kernel page cache, to allow parallelizing of the memory
> copy, as otherwise io_method=io_uring would be a lot slower than
> io_method=worker in that case.
>
> An upcoming commit will make read_stream.c only increase the read-ahead
> distance if we needed to wait for IO to complete. If to-be-read data is in the
> kernel page cache, io_uring will synchronously execute IO, unless the IO is
> flagged as async.  Therefore the aforementioned change in read_stream.c
> heuristic would lead to a substantial performance regression with io_uring
> when data is in the page cache, as we would never reach a deep enough queue to
> actually trigger the existing heuristic.
>
> Parallelizing the copy from the page cache is mainly important when doing a
> lot of IO, which commonly is only possible when doing largely sequential IO.
>
> The reason we don't just mark all io_uring IOs as asynchronous is that the
> dispatch to a kernel thread has overhead. This overhead is mostly noticeable
> with small random IOs with a low queue depth, as in that case the gain from
> parallelizing the memory copy is small and the latency cost high.
>
> The facts from the two prior paragraphs show a way out: Use the size of the IO
> in addition to the depth of the queue to trigger asynchronous processing.
>
> One might think that just using the IO size might be enough, but
> experimentation has shown that not to be the case - with deep look-ahead
> distances being able to parallelize the memory copy is important even with
> smaller IOs.

> +/*
> + * io_uring executes IO in process context if possible. That's generally good,
> + * as it reduces context switching. When performing a lot of buffered IO that
> + * means that copying between page cache and userspace memory happens in the
> + * foreground, as it can't be offloaded to DMA hardware as is possible when
> + * using direct IO. When executing a lot of buffered IO this causes io_uring
> + * to be slower than worker mode, as worker mode parallelizes the
> + * copying. io_uring can be told to offload work to worker threads instead.
> + *
> + * If the IOs are small, we only benefit from forcing things into the
> + * background if there is a lot of IO, as otherwise the overhead from context
> + * switching is higher than the gain.
> + *
> + * If IOs are large, there is benefit from asynchronous processing at lower
> + * queue depths, as IO latency is less of a crucial factor and parallelizing
> + * memory copies is more important.  In addition, it is important to trigger
> + * asynchronous processing even at low queue depth, as with foreground
> + * processing we might never actually reach deep enough IO depths to trigger
> + * asynchronous processing, which in turn would deprive readahead control
> + * logic of information about whether a deeper look-ahead distance would be
> + * advantageous.
> + *
> + * We have done some basic benchmarking to validate the thresholds used, but
> + * it's quite plausible that there are better values.

Thought it'd be useful to actually have an email to point to in the above
comment, with details about what benchmark I ran.

Previously I'd just manually run fio with different options, I made it a bit
more systematic with the attached (only halfway hand written) script.

I attached two different results, once when allowing access to multiple cores,
and once with a single core (simulating a very busy machine).

(nblocks is in multiples of 8KB)

Multi-core:

nblocks    iod    async    bw_gib_s    lat_usec
1    1    0    4.2075    1.5802
1    1    1    1.0462    6.9652
1    2    0    4.1362    3.4533
1    2    1    1.9284    7.6040
1    4    0    4.0030    7.3720
1    4    1    4.2713    6.9086
1    8    0    4.1653    14.4072
1    8    1    4.3301    13.8365
1    16    0    4.1829    28.9216
1    16    1    4.3006    28.1261
1    32    0    4.0735    59.6232
1    32    1    4.3248    56.1614

I.e at nblocks=1, there's pretty much no gain from async, and the latency
increases markedly at the low end and just about catches up at the high end.

Around an iodepth 4 the loss from async nonexistant or minimal.


2    1    0    5.7289    2.4261
2    1    1    1.8708    7.7466
2    2    0    5.7964    5.0144
2    2    1    3.3749    8.7417
2    4    0    5.8434    10.2023
2    4    1    7.9783    7.3977
2    8    0    5.8166    20.7226
2    8    1    8.2545    14.5431
2    16    0    5.8215    41.6613
2    16    1    8.2354    29.3879
2    32    0    5.6530    86.0286
2    32    1    8.3218    58.3826

With nblocks=2, there start to be gains at higher IO depths, but they're still
somewhat limited.  Latency already starts to be better at iodepth 4.


4    1    0    7.4131    3.8807
4    1    1    3.2133    9.1827
4    2    0    7.3150    8.0854
4    2    1    5.4983    10.8039
4    4    0    7.2784    16.5097
4    4    1    11.2717    10.5699
4    8    0    7.2873    33.2331
4    8    1    16.6299    14.4164
4    16    0    7.1606    67.8777
4    16    1    16.9794    28.4981
4    32    0    6.2954    154.6834
4    32    1    16.3686    59.3610

With nblocks=4, async shows much more substantial gains. Latency of async at
the high end is also much improved.


8    1    0    8.0403    7.3503
8    1    1    4.6038    12.7202
8    2    0    8.0052    14.9161
8    2    1    8.5176    13.9987
8    4    0    8.1519    29.6698
8    4    1    14.8211    16.1640
8    8    0    7.8525    61.8612
8    8    1    27.5860    17.4434
8    16    0    6.8887    141.3268
8    16    1    34.1307    28.3463
8    32    0    6.9031    282.2350
8    32    1    38.2430    50.7700

With nblocks=8, async is faster already at iodepth 2.


64    1    0    9.1983    52.6768
64    1    1    8.1505    59.5486

128    1    0    7.5442    128.8704
128    1    1    7.3481    132.2355

Somewhere nblocks=64 and 128, we reach the point where there's basically no
loss at iodepth 1.


This seems to validate setting IOSQE_ASYNC around a block size of >= 4 and a
queue depth of > 4. I guess it could make sense to reduce it from > 4 to >= 4
based on these numbers, but I don't think it matters terribly.



Obviously with just one core there will only ever be a loss from doing an
asynchronous / concurrent copy from the page cache. But it's interesting to
see where the overhead of async starts to be less of a factor.

At iodepth 1 (worse case, a context switch for every IO)

nblocks    iod    async    bw_gib_s    lat_usec
1    1    0    4.2324    1.5692
1    1    1    1.7883    3.9574
2.36x bw regression

2    1    0    5.7914    2.4004
2    1    1    2.9585    4.8417
1.96x bw regression

4    1    0    7.3171    3.9242
4    1    1    4.2450    6.8171
1.7x bw regression

8    1    0    8.1162    7.2674
8    1    1    5.7536    10.2948
1.4x bw regression

16    1    0    8.8559    13.5212
16    1    1    7.1163    16.8277
1.6x bw regression


But the IO depth would not stay at 1 in the case of postgres with the proposed
changes, it'd ramp up due to needing to wait for the kernel to complete those
IOs asynchronously.

Therefore comparing that to a deeper IO depth.

nblocks    iod    async    bw_gib_s    lat_usec
1    16    0    4.1094    29.4339
1    16    1    3.3922    35.7044
1.21x bw regression

2    16    0    5.8381    41.5402
2    16    1    4.8104    50.4571
1.21x bw regression

4    16    0    7.1204    68.2612
4    16    1    5.6479    86.0973
1.26x bw regression

8    16    0    7.0780    137.5520
8    16    1    6.1687    157.8805
1.14x bw regression

16    16    0    7.4523    261.4281
16    16    1    6.7192    290.0837
1.10x bw regression


This assumes a very extreme scenario (no cycles whatsoever available for
parallelism), so I'm just looking for the worst case regression here.


I don't think there's very clear indicators for what cutoffs to use in the
onecpu data. Clearly we shouldn't go for async for single block IOs, but we
aren't.  With the default io_combine_limit=16 effective_io_concurrency=16,
we'd end up with 1.10x regression in the extreme case of only having a single
core available (but that one fully!) and doing nothing other than IO.

Seems ok to me.


I ran it on three other machines (newer workstation, laptop, old laptop) as
well, with similarly shaped results (although considerably higher & lower
throughputs across the board, depending on the machine).

Zen 4 Laptop:
nblocks    iod    async    bw_gib_s    lat_usec
1    1    0    6.0989    1.1408
1    1    1    1.4477    5.1246
1    2    0    6.9600    2.0827
1    2    1    2.8750    5.1711
1    4    0    7.0283    4.2307
1    4    1    8.9174    3.3169

Suprisingly bigger difference between sync/async at iod=1, but it's again
similar around iod=4 blocks.


4    1    0    14.5638    1.9616
4    1    1    5.1245    5.8016
4    2    0    14.8867    3.9607
4    2    1    12.1841    4.8662
4    4    0    14.8678    8.0764
4    4    1    21.5077    5.5417

Similar.


16    1    0    21.0754    5.5891
16    1    1    12.6180    9.4753
16    2    0    20.2770    11.8353
16    2    1    24.3277    9.8172

At nblocks=16, iod=2 starts already starts to be faster.



Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Changing the state of data checksums in a running cluster
Next
From: Andreas Karlsson
Date:
Subject: Re: [doc] pg_ctl: fix wrong description for -l