Thread: BAS_BULKREAD vs read stream
Hi, There are two issues with BAS_BULKREAD interactions with read stream. One for 17+ and one for 18+. The 17+ issue: While trying to analyze issues with BAS_BULKREAD's size vs AIO I was getting really confusing results. After an embarassingly long time I figured out that a good chunk of those were due to BAS_BULKREAD not actually limiting buffer usage - we'd end up filling shared buffers despite using the strategy. There are two reasons for that: - read_stream.c doesn't account for the buffer it is returning from read_stream_next_buffer() If we subtract 1 from GetAccessStrategyPinLimit() this avenue for escaping the strategy is closed. The amount of "leakage" reduces very substantially. - read_stream.c doesn't know about buffer pins held externally. For sequential scans the buffer pin held by the slot passed to heap_getnextslot() will occasionally cause a buffer to be still pinned when the strategy wants to reuse the buffer, preventing reuse. This causes a slower "leak", but it's still rather substantial. Obviously there can be plans that pin buffers for longer, but that's relatively rare and nothing new. Whereas leaking buffer even with a plain seqscan is new in 17. This issue exists in 17, but is harder to hit, because we don't actually read ahead with sequential scans. If we just do a single read at a time, we don't pin more than io_combine_limit IOs. The default io_combine_limit is 16, whereas BAS_BULKREAD is 32 buffers with the default BLCKSZ=8192. As soon as one uses io_combine_limit=32 the issue reproduces on 17. With AIO this obviously becomes easier to hit. I think medium term we need to overhaul how strategies work rather substantially. The whole concept isn't quite right [1]. But for now we need a solution that's doable for both 17 and 18. While not pretty, I think the best short term solution is to just subtract 2 from the strategy limit. That seems to work for all common uses of BAS_BULKREAD currently. I suspect that we should subtract one in read_stream.c (to account for the buffer returned by read_stream_next_buffer()) and one in GetAccessStrategyPinLimit() (to account for pins held externally). The 18 issue: Right now there's a noticeable performance drop when going from a seqscan that doesn't use BAS_BULKREAD to one that uses BAS_BULKREAD. This is very pronounced when not hitting the OS page cache, but noticeable even otherwise. The reason for that slowdown is that BAS_BULKREAD is so small that it just allows 1-2 IOs in flight (io_combine_limit = 16 / 128kB vs BAS_BULKREAD of 256kB). Whereas, with a smaller table that doesn't hit BAS_BULKREAD, we can have 16 concurrent with the default settings, assuming a large enough shared buffers (on very small s_b we'll limit how many buffers we pin). The obvious solution to that would be to increase BAS_BULKREAD substantially above 256kB. For quite a while was worried about increasing the size, because somewhere (I couldn't find it while writing this email, will add the reference once I refound it) we have a comment explaining that a small size was chosen because it helps with CPU cache efficiency. Initially I could indeed see reasonably sized effects, but a good chunk of that turned out to be the issue above, where small sizes simply wouln't actually use the ring and thus have completely different performance characteristics. I *can* easily reproduce effects of doing very large reads when the data is in the page cache, which indeed seems to be related to cache effects, which seems to be related to some CPU oddities around crossing kernel/user memory, see [3]. But that's a separate issue from the ring size itself. I also can reproduce some negative effects of using larger ring sizes for io_method=sync. A 10GB pg_prewarm(), that I extended with the ability to use a strategy of a certain size, slows down ~22% when going from 8MB to 16MB. Obviously io_method=sync does not benefit from a larger ring size, as it just executes IO's synchronous, which is not limited by the ring size. The slowdown for io_method=worker for very small ring sizes is substantial, whether the buffer is in the page cache or not. It simply limits the asynchronizity too much. At 256kB io_method=worker is ~50% slower than io_method=sync, at 512 it's 20% faster, at 1MB io_method=worker is 2.5x faster. With io_uring there is is a 7% regression at 128kB (lower than the current default), at 256kB io_uring is 5% faster, reaching 1.9x at 3MB. I think we should consider increasing BAS_BULKREAD TO something like Min(256, io_combine_limit * (effective_io_concurrency + 1)) As I said earlier, I think we need to redesign strategies, but this seems to address the regression when going from no-strategy to strategy, without causing any meaninful regressions. I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and couldn't come up with any benefits. It seems to hurt fairly quickly. Greetings, Andres Freund [1] The whole idea of reusing buffers after a set distance is close to nonsensical for BAS_BULKREAD - we should reuse them as soon as viable (i.e. not pinned anymore), for cache efficiency. The only reason deferring the reuse makes *some* sense is [2]. It doesn't even really make sense for the other BAS_'s. What we should control there is the amount of dirty buffers and how to flush WAL at a sensible rate. But with BAS_VACUUM that doesn't work at all, as it's used for both reads and writes. If vacuum does't modify buffers, we can end up flushing WAL way too frequently and if we only rarely modify, the amount of buffers in the ring is too big. [2] For synchronize_seqscans to work, the BAS_BULKREAD size needs to be smaller than SYNC_SCAN_REPORT_INTERVAL. The factor is 2x right now - way way too small with IO rates achievable on even a cheap laptop. Cheap storage can do reads at gigabytes/second and one backend can process many hundreds of MB each second. Having only 128kB between SYNC_SCAN_REPORT_INTERVAL and the BAS_BULKREAD makes it very likely that we have to re-read the buffer from the OS. That's bad with buffered IO, catastrophic with direct IO. On a local NVMe SSDs on a large table with the normal BAS_BULKREAD I often get effectively *no* buffer hits when running two concurrent seqscans, even if I disable query parallelism. Sometimes in the 10s, sometimes in the low thousands, always < 0.1%. At 16MB I get 25% hits, at 128MB it's close to 50% (the max you could get). With parallelism I see very low hit rates even with 16MB, only around but with 256MB it starts to get better. [3] https://postgr.es/m/yhklc3wuxt4l42tpah37rzsxoycresoiae22h2eluotrwr37gq%403r54w5zqldwn
On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote: > > I think we should consider increasing BAS_BULKREAD TO something like > Min(256, io_combine_limit * (effective_io_concurrency + 1)) Do you mean Max? If so, this basically makes sense to me. Overall, I think even though the ring is about reusing buffers, we have to think about how many IOs that reasonably is -- which this formula does. You mentioned testing with 8MB, did you see some sort of clipp anywhere between 256 and 8MB? > I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and > couldn't come up with any benefits. It seems to hurt fairly quickly. So, how will you deal with it when the BAS_BULKREAD ring is bigger? - Melanie
Hi, On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote: > On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote: > > > > I think we should consider increasing BAS_BULKREAD TO something like > > Min(256, io_combine_limit * (effective_io_concurrency + 1)) > > Do you mean Max? If so, this basically makes sense to me. Err, yes. I was wondering whether we should add a Max(SYNC_SCAN_REPORT_INTERVAL, ...), but it's a private value, and the proposed formula doesn't really change anything for SYNC_SCAN_REPORT_INTERVAL. So I think it's fine. > Overall, I think even though the ring is about reusing buffers, we > have to think about how many IOs that reasonably is -- which this > formula does. Right - the prior limit kinda somewhat made sense before we had IO combining, but after that *and* having AIO it is clearly obsoleted. > You mentioned testing with 8MB, did you see some sort of clipp anywhere > between 256 and 8MB? There's not really a single cliff. For buffered, fully cached IO: With io_method=sync, it gets way better between 64 and 128kB, then gets worse between 128kB and 256kB (the current value), and then seems to gradually gets worse starting somewhere around 8MB. 32MB is 50% slower than 8MB... io_method=worker is awful with 64-128kB, not great at 256kB and then is very good. There's a 10% decline from 16MB->32MB. io_method=io_uring is similar to sync at 64-128kB, very good from then on. I do see a 6% decline from 16MB->32MB. I suspect the 16-32MB cliff is due to L3 related effects, which is 13.8M per per socket (of which I have 2). It's not entirely clear what that effect is - all the additional cycles are spent in the kernel, not in userspace. I strongly suspect it's related to SMAP [1], but I don't really understand the details. All I know is that disabling SMAP removes this cliff on several Intel and AMD systems, both client and server CPUs. For buffered, non-cached IO: io_method=sync: I see no performance difference across all ring sizes. io_method=worker: Performance is ~12% worse than sync at <= 256kB, 1.36x faster at 512kB, 2.07x at 1MB, 3.0x at 4MB, and then it stays the same up to 64MB. io_method=io_uring: equivalent to sync at <= 256kB, 1.54x faster at 512kB, 3.2x faster at 4MB and stays the same up to 64MB. For DIO/unbuffered IO: As io_method=sync, obviously, doesn't do DIO/unbuffered IO in a reasonable way, it doesn't make sense to compare it. So I'm comparing to buffered IO. io_method=worker: Performance is terrifyingly bad at 128kB (like 0.41x the throughput of buffered IO), slightly worse than buffered at 256kB, Best perf is reached at 4MB and stays very consistent after that. io_method=uring: Performance is terrifyingly bad at <= 256kB (like 0.43x the throughput of buffered IO) and starts to be decent after that. Best perf is reached at 4MB and stays very consistent after that. The peak perf of buffered but uncached IO and DIO is rather close, as I'm testing this on a PCIe3 drive. The difference in CPU cycles is massive though: worker buffered: 9,850.27 msec cpu-clock # 3.001 CPUs utilized 305,050 context-switches # 30.969 K/sec 51,049 cpu-migrations # 5.182 K/sec 11,530 page-faults # 1.171 K/sec 16,615,532,455 instructions # 0.84 insn per cycle (30.72%) 19,876,584,840 cycles # 2.018 GHz (30.75%) 3,256,065,951 branches # 330.556 M/sec (30.78%) 26,046,144 branch-misses # 0.80% of all branches (30.81%) 4,452,808,846 L1-dcache-loads # 452.050 M/sec (30.83%) 574,304,216 L1-dcache-load-misses # 12.90% of all L1-dcache accesses (30.82%) 169,117,254 LLC-loads # 17.169 M/sec (30.82%) 82,769,152 LLC-load-misses # 48.94% of all LL-cache accesses (30.82%) 377,137,247 L1-icache-load-misses (30.78%) 4,475,873,620 dTLB-loads # 454.391 M/sec (30.76%) 5,496,266 dTLB-load-misses # 0.12% of all dTLB cache accesses (30.73%) 9,765,507 iTLB-loads # 991.395 K/sec (30.70%) 7,525,173 iTLB-load-misses # 77.06% of all iTLB cache accesses (30.70%) 3.282465335 seconds time elapsed worker DIO: 9,783.05 msec cpu-clock # 3.000 CPUs utilized 356,102 context-switches # 36.400 K/sec 32,575 cpu-migrations # 3.330 K/sec 1,245 page-faults # 127.261 /sec 8,076,414,780 instructions # 1.00 insn per cycle (30.73%) 8,109,508,194 cycles # 0.829 GHz (30.73%) 1,585,426,781 branches # 162.058 M/sec (30.74%) 17,869,296 branch-misses # 1.13% of all branches (30.78%) 2,199,974,033 L1-dcache-loads # 224.876 M/sec (30.79%) 167,855,899 L1-dcache-load-misses # 7.63% of all L1-dcache accesses (30.79%) 31,303,238 LLC-loads # 3.200 M/sec (30.79%) 2,126,825 LLC-load-misses # 6.79% of all LL-cache accesses (30.79%) 322,505,615 L1-icache-load-misses (30.79%) 2,186,161,593 dTLB-loads # 223.464 M/sec (30.79%) 3,892,051 dTLB-load-misses # 0.18% of all dTLB cache accesses (30.79%) 10,306,643 iTLB-loads # 1.054 M/sec (30.77%) 6,279,217 iTLB-load-misses # 60.92% of all iTLB cache accesses (30.74%) 3.260901966 seconds time elapsed io_uring buffered: 9,924.48 msec cpu-clock # 2.990 CPUs utilized 340,821 context-switches # 34.341 K/sec 57,048 cpu-migrations # 5.748 K/sec 1,336 page-faults # 134.617 /sec 16,630,629,989 instructions # 0.88 insn per cycle (30.74%) 18,985,579,559 cycles # 1.913 GHz (30.64%) 3,253,081,357 branches # 327.784 M/sec (30.67%) 24,599,858 branch-misses # 0.76% of all branches (30.68%) 4,515,979,721 L1-dcache-loads # 455.035 M/sec (30.69%) 556,041,180 L1-dcache-load-misses # 12.31% of all L1-dcache accesses (30.67%) 160,198,962 LLC-loads # 16.142 M/sec (30.65%) 75,164,349 LLC-load-misses # 46.92% of all LL-cache accesses (30.65%) 348,585,830 L1-icache-load-misses (30.63%) 4,473,414,356 dTLB-loads # 450.746 M/sec (30.91%) 1,193,495 dTLB-load-misses # 0.03% of all dTLB cache accesses (31.04%) 5,507,512 iTLB-loads # 554.942 K/sec (31.02%) 2,973,177 iTLB-load-misses # 53.98% of all iTLB cache accesses (31.02%) 3.319117422 seconds time elapsed io_uring DIO: 9,782.99 msec cpu-clock # 3.000 CPUs utilized 96,916 context-switches # 9.907 K/sec 8 cpu-migrations # 0.818 /sec 1,001 page-faults # 102.320 /sec 5,902,978,172 instructions # 1.45 insn per cycle (30.73%) 4,059,940,112 cycles # 0.415 GHz (30.73%) 1,117,690,786 branches # 114.248 M/sec (30.75%) 10,994,087 branch-misses # 0.98% of all branches (30.77%) 1,559,149,686 L1-dcache-loads # 159.374 M/sec (30.78%) 85,057,280 L1-dcache-load-misses # 5.46% of all L1-dcache accesses (30.78%) 11,393,236 LLC-loads # 1.165 M/sec (30.78%) 2,599,701 LLC-load-misses # 22.82% of all LL-cache accesses (30.79%) 174,124,990 L1-icache-load-misses (30.80%) 1,545,148,685 dTLB-loads # 157.942 M/sec (30.79%) 156,524 dTLB-load-misses # 0.01% of all dTLB cache accesses (30.79%) 3,325,307 iTLB-loads # 339.907 K/sec (30.77%) 2,288,730 iTLB-load-misses # 68.83% of all iTLB cache accesses (30.74%) 3.260716339 seconds time elapsed I'd say a 4.5x reduction in cycles is rather nice :) > > I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and > > couldn't come up with any benefits. It seems to hurt fairly quickly. > > So, how will you deal with it when the BAS_BULKREAD ring is bigger? I think I would just leave it at the current value. What I meant with "hurt fairly quickly" is that *increasing* SYNC_SCAN_REPORT_INTERVAL seems to make synchronize_seqscans work even less well. Greetings, Andres Freund [1] https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
Hi, On 2025-04-07 16:28:20 -0400, Andres Freund wrote: > On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote: > > On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > I think we should consider increasing BAS_BULKREAD TO something like > > > Min(256, io_combine_limit * (effective_io_concurrency + 1)) > > > > Do you mean Max? If so, this basically makes sense to me. > > Err, yes. In the attached I implemented the above idea, with some small additional refinements: - To allow sync seqscans to work at all, we should only *add* to the 256kB that we currently have - otherwise all buffers in a ring will be undergoing IO, never allowing two synchronizing scans to actually use the buffers that the other scan has already read in. This also kind of obsoletes the + 1 in the formula above, although that is arguable, particularly for effective_io_concurrency=0. - If the backend has a PinLimit() that won't allow io_combine_limit * effective_io_concurrency buffers to undergo IO, it doesn't make sense to make the ring bigger. At best it would waste space for the ring, at worst it'd make "ring escapes" inevitable - victim buffer search would always replace buffers that we have in the ring. - the multiplication by (BLCKSZ / 1024) that I omitted above is actually included :) I unfortunately think we do need *something* to address $subject for 18 - the performance regression when increasing relation sizes is otherwise just too big - it's trivial to find queries getting slower by more than 4x. On local, low-latency NVMe storage - on network storage the regression will often be bigger. If we don't do something for 18, only consolation would be that the performance when using the 256kB BAS_BULKREAD is rather close to the performance one gets in 17, with or without without a strategy. But I don't think that would make it less surprising that once your table grows sufficient to use a strategy your IO throughput craters. I've some local prototype for the 17/18 "strategy escape" issue, will work on polishing that soon, unless you have something for that Thomas? Greetings, Andres Freund
Attachment
On Tue, Apr 8, 2025 at 2:20 PM Andres Freund <andres@anarazel.de> wrote: > In the attached I implemented the above idea, with some small additional > refinements: LGTM. How I wish EXPLAIN would show some clues about strategies...
Hi, On 2025-04-08 18:11:04 +1200, Thomas Munro wrote: > On Tue, Apr 8, 2025 at 2:20 PM Andres Freund <andres@anarazel.de> wrote: > > In the attached I implemented the above idea, with some small additional > > refinements: > > LGTM. Thanks for checking. > How I wish EXPLAIN would show some clues about strategies... Indeed. There will be some interesting piercing of layers to make that work... Greetings, Andres Freund
On Sun, Apr 6, 2025 at 10:15 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, [..] > The obvious solution to that would be to increase BAS_BULKREAD substantially > above 256kB. > > For quite a while was worried about increasing the size, because somewhere (I > couldn't find it while writing this email, will add the reference once I > refound it) we have a comment explaining that a small size was chosen because > it helps with CPU cache efficiency. Hi, FWIW, I was trying to understand the scope of this change and GetAccessStrategy() actually asks to go to src/backend/storage/buffer/README which explains the logic behind the old (pre-commit now) rationale and value. It says ``` For sequential scans, a 256KB ring is used. That's small enough to fit in L2 cache, which makes transferring pages from OS cache to shared buffer cache efficient. ``` -J.