Re: Allow io_combine_limit up to 1MB - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Allow io_combine_limit up to 1MB
Date
Msg-id CAKZiRmymVanRcmmbtE+WHn2YJ89CkeYbQHgjJ3DSD53SBd5BkA@mail.gmail.com
Whole thread Raw
In response to Re: Allow io_combine_limit up to 1MB  (Andres Freund <andres@anarazel.de>)
Responses Re: Allow io_combine_limit up to 1MB
List pgsql-hackers
On Wed, Feb 12, 2025 at 1:03 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-02-11 13:12:17 +1300, Thomas Munro wrote:
> > Tomas queried[1] the limit of 256kB (or really 32 blocks) for
> > io_combine_limit.  Yeah, I think we should increase it and allow
> > experimentation with larger numbers.  Note that real hardware and
> > protocols have segment and size limits that can force the kernel to
> > split your I/Os, so it's not at all a given that it'll help much or at
> > all to use very large sizes, but YMMV.

+0.02 to the initiative, I've been always wondering why the IOs were
so capped, I know :)

> FWIW, I see substantial performance *regressions* with *big* IO sizes using
> fio. Just looking at cached buffered IO.
>
> for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; numactl --physcpubind 3 fio --directory
/srv/dev/fio/--size=32GiB --overwrite 1 --time_based=0 --runtime=10 --name test --rw read --buffered 0 --ioengine psync
--buffered1 --invalidate 0 --output-format json --bs=$((1024*${s})) |jq '.jobs[] | .read.bw_mean';done 
>
> io size kB      throughput in MB/s
[..]
> 256             16864
> 512             19114
> 1024            12874
[..]

> It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
> *vastly* better performance:
>
> io size kB      throughput in MB/s
[..]
> 128             23133
> 256             23317
> 512             25829
> 1024            15912
[..]
> Most of the gain isn't due to mitigations=off but clearcpuid=smap. Apparently
> SMAP, which requires explicit code to allow kernel space to access userspace
> memory, to make exploitation harder, reacts badly to copying lots of memory.
>
> This seems absolutely bonkers to me.

There are two bizarre things there, +35% perf boost just like that due
to security drama, and that io_size=512kb being so special to give a
10-13% boost in Your case? Any ideas, why? I've got on that Lsv2
individual MS nvme under Hyper-V, on ext4, which seems to be much more
real world and average Joe situation, and it is much slower, but it is
not showing advantage for blocksize beyond let's say 128:

io size kB      throughput in MB/s
4        1070
8        1117
16        1231
32        1264
64        1249
128        1313
256        1323
512        1257
1024    1216
2048    1271
4096    1304
8192    1214

top hitter on of course stuff like clear_page_rep [k] and
rep_movs_alternative [k] (that was with mitigations=on).

-J.



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Dmitry Dolgov
Date:
Subject: Re: pg_stat_statements and "IN" conditions