Re: BAS_BULKREAD vs read stream - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: BAS_BULKREAD vs read stream |
Date | |
Msg-id | dr4rjc4xewy5uf2dzywuq2fh6fnaydiwxexumjx3b6hkefatcn@kibyxaztit2i Whole thread Raw |
In response to | Re: BAS_BULKREAD vs read stream (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: BAS_BULKREAD vs read stream
|
List | pgsql-hackers |
Hi, On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote: > On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote: > > > > I think we should consider increasing BAS_BULKREAD TO something like > > Min(256, io_combine_limit * (effective_io_concurrency + 1)) > > Do you mean Max? If so, this basically makes sense to me. Err, yes. I was wondering whether we should add a Max(SYNC_SCAN_REPORT_INTERVAL, ...), but it's a private value, and the proposed formula doesn't really change anything for SYNC_SCAN_REPORT_INTERVAL. So I think it's fine. > Overall, I think even though the ring is about reusing buffers, we > have to think about how many IOs that reasonably is -- which this > formula does. Right - the prior limit kinda somewhat made sense before we had IO combining, but after that *and* having AIO it is clearly obsoleted. > You mentioned testing with 8MB, did you see some sort of clipp anywhere > between 256 and 8MB? There's not really a single cliff. For buffered, fully cached IO: With io_method=sync, it gets way better between 64 and 128kB, then gets worse between 128kB and 256kB (the current value), and then seems to gradually gets worse starting somewhere around 8MB. 32MB is 50% slower than 8MB... io_method=worker is awful with 64-128kB, not great at 256kB and then is very good. There's a 10% decline from 16MB->32MB. io_method=io_uring is similar to sync at 64-128kB, very good from then on. I do see a 6% decline from 16MB->32MB. I suspect the 16-32MB cliff is due to L3 related effects, which is 13.8M per per socket (of which I have 2). It's not entirely clear what that effect is - all the additional cycles are spent in the kernel, not in userspace. I strongly suspect it's related to SMAP [1], but I don't really understand the details. All I know is that disabling SMAP removes this cliff on several Intel and AMD systems, both client and server CPUs. For buffered, non-cached IO: io_method=sync: I see no performance difference across all ring sizes. io_method=worker: Performance is ~12% worse than sync at <= 256kB, 1.36x faster at 512kB, 2.07x at 1MB, 3.0x at 4MB, and then it stays the same up to 64MB. io_method=io_uring: equivalent to sync at <= 256kB, 1.54x faster at 512kB, 3.2x faster at 4MB and stays the same up to 64MB. For DIO/unbuffered IO: As io_method=sync, obviously, doesn't do DIO/unbuffered IO in a reasonable way, it doesn't make sense to compare it. So I'm comparing to buffered IO. io_method=worker: Performance is terrifyingly bad at 128kB (like 0.41x the throughput of buffered IO), slightly worse than buffered at 256kB, Best perf is reached at 4MB and stays very consistent after that. io_method=uring: Performance is terrifyingly bad at <= 256kB (like 0.43x the throughput of buffered IO) and starts to be decent after that. Best perf is reached at 4MB and stays very consistent after that. The peak perf of buffered but uncached IO and DIO is rather close, as I'm testing this on a PCIe3 drive. The difference in CPU cycles is massive though: worker buffered: 9,850.27 msec cpu-clock # 3.001 CPUs utilized 305,050 context-switches # 30.969 K/sec 51,049 cpu-migrations # 5.182 K/sec 11,530 page-faults # 1.171 K/sec 16,615,532,455 instructions # 0.84 insn per cycle (30.72%) 19,876,584,840 cycles # 2.018 GHz (30.75%) 3,256,065,951 branches # 330.556 M/sec (30.78%) 26,046,144 branch-misses # 0.80% of all branches (30.81%) 4,452,808,846 L1-dcache-loads # 452.050 M/sec (30.83%) 574,304,216 L1-dcache-load-misses # 12.90% of all L1-dcache accesses (30.82%) 169,117,254 LLC-loads # 17.169 M/sec (30.82%) 82,769,152 LLC-load-misses # 48.94% of all LL-cache accesses (30.82%) 377,137,247 L1-icache-load-misses (30.78%) 4,475,873,620 dTLB-loads # 454.391 M/sec (30.76%) 5,496,266 dTLB-load-misses # 0.12% of all dTLB cache accesses (30.73%) 9,765,507 iTLB-loads # 991.395 K/sec (30.70%) 7,525,173 iTLB-load-misses # 77.06% of all iTLB cache accesses (30.70%) 3.282465335 seconds time elapsed worker DIO: 9,783.05 msec cpu-clock # 3.000 CPUs utilized 356,102 context-switches # 36.400 K/sec 32,575 cpu-migrations # 3.330 K/sec 1,245 page-faults # 127.261 /sec 8,076,414,780 instructions # 1.00 insn per cycle (30.73%) 8,109,508,194 cycles # 0.829 GHz (30.73%) 1,585,426,781 branches # 162.058 M/sec (30.74%) 17,869,296 branch-misses # 1.13% of all branches (30.78%) 2,199,974,033 L1-dcache-loads # 224.876 M/sec (30.79%) 167,855,899 L1-dcache-load-misses # 7.63% of all L1-dcache accesses (30.79%) 31,303,238 LLC-loads # 3.200 M/sec (30.79%) 2,126,825 LLC-load-misses # 6.79% of all LL-cache accesses (30.79%) 322,505,615 L1-icache-load-misses (30.79%) 2,186,161,593 dTLB-loads # 223.464 M/sec (30.79%) 3,892,051 dTLB-load-misses # 0.18% of all dTLB cache accesses (30.79%) 10,306,643 iTLB-loads # 1.054 M/sec (30.77%) 6,279,217 iTLB-load-misses # 60.92% of all iTLB cache accesses (30.74%) 3.260901966 seconds time elapsed io_uring buffered: 9,924.48 msec cpu-clock # 2.990 CPUs utilized 340,821 context-switches # 34.341 K/sec 57,048 cpu-migrations # 5.748 K/sec 1,336 page-faults # 134.617 /sec 16,630,629,989 instructions # 0.88 insn per cycle (30.74%) 18,985,579,559 cycles # 1.913 GHz (30.64%) 3,253,081,357 branches # 327.784 M/sec (30.67%) 24,599,858 branch-misses # 0.76% of all branches (30.68%) 4,515,979,721 L1-dcache-loads # 455.035 M/sec (30.69%) 556,041,180 L1-dcache-load-misses # 12.31% of all L1-dcache accesses (30.67%) 160,198,962 LLC-loads # 16.142 M/sec (30.65%) 75,164,349 LLC-load-misses # 46.92% of all LL-cache accesses (30.65%) 348,585,830 L1-icache-load-misses (30.63%) 4,473,414,356 dTLB-loads # 450.746 M/sec (30.91%) 1,193,495 dTLB-load-misses # 0.03% of all dTLB cache accesses (31.04%) 5,507,512 iTLB-loads # 554.942 K/sec (31.02%) 2,973,177 iTLB-load-misses # 53.98% of all iTLB cache accesses (31.02%) 3.319117422 seconds time elapsed io_uring DIO: 9,782.99 msec cpu-clock # 3.000 CPUs utilized 96,916 context-switches # 9.907 K/sec 8 cpu-migrations # 0.818 /sec 1,001 page-faults # 102.320 /sec 5,902,978,172 instructions # 1.45 insn per cycle (30.73%) 4,059,940,112 cycles # 0.415 GHz (30.73%) 1,117,690,786 branches # 114.248 M/sec (30.75%) 10,994,087 branch-misses # 0.98% of all branches (30.77%) 1,559,149,686 L1-dcache-loads # 159.374 M/sec (30.78%) 85,057,280 L1-dcache-load-misses # 5.46% of all L1-dcache accesses (30.78%) 11,393,236 LLC-loads # 1.165 M/sec (30.78%) 2,599,701 LLC-load-misses # 22.82% of all LL-cache accesses (30.79%) 174,124,990 L1-icache-load-misses (30.80%) 1,545,148,685 dTLB-loads # 157.942 M/sec (30.79%) 156,524 dTLB-load-misses # 0.01% of all dTLB cache accesses (30.79%) 3,325,307 iTLB-loads # 339.907 K/sec (30.77%) 2,288,730 iTLB-load-misses # 68.83% of all iTLB cache accesses (30.74%) 3.260716339 seconds time elapsed I'd say a 4.5x reduction in cycles is rather nice :) > > I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and > > couldn't come up with any benefits. It seems to hurt fairly quickly. > > So, how will you deal with it when the BAS_BULKREAD ring is bigger? I think I would just leave it at the current value. What I meant with "hurt fairly quickly" is that *increasing* SYNC_SCAN_REPORT_INTERVAL seems to make synchronize_seqscans work even less well. Greetings, Andres Freund [1] https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
pgsql-hackers by date: