BAS_BULKREAD vs read stream - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | BAS_BULKREAD vs read stream |
Date | |
Msg-id | lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6 Whole thread Raw |
Responses |
Re: BAS_BULKREAD vs read stream
Re: BAS_BULKREAD vs read stream |
List | pgsql-hackers |
Hi, There are two issues with BAS_BULKREAD interactions with read stream. One for 17+ and one for 18+. The 17+ issue: While trying to analyze issues with BAS_BULKREAD's size vs AIO I was getting really confusing results. After an embarassingly long time I figured out that a good chunk of those were due to BAS_BULKREAD not actually limiting buffer usage - we'd end up filling shared buffers despite using the strategy. There are two reasons for that: - read_stream.c doesn't account for the buffer it is returning from read_stream_next_buffer() If we subtract 1 from GetAccessStrategyPinLimit() this avenue for escaping the strategy is closed. The amount of "leakage" reduces very substantially. - read_stream.c doesn't know about buffer pins held externally. For sequential scans the buffer pin held by the slot passed to heap_getnextslot() will occasionally cause a buffer to be still pinned when the strategy wants to reuse the buffer, preventing reuse. This causes a slower "leak", but it's still rather substantial. Obviously there can be plans that pin buffers for longer, but that's relatively rare and nothing new. Whereas leaking buffer even with a plain seqscan is new in 17. This issue exists in 17, but is harder to hit, because we don't actually read ahead with sequential scans. If we just do a single read at a time, we don't pin more than io_combine_limit IOs. The default io_combine_limit is 16, whereas BAS_BULKREAD is 32 buffers with the default BLCKSZ=8192. As soon as one uses io_combine_limit=32 the issue reproduces on 17. With AIO this obviously becomes easier to hit. I think medium term we need to overhaul how strategies work rather substantially. The whole concept isn't quite right [1]. But for now we need a solution that's doable for both 17 and 18. While not pretty, I think the best short term solution is to just subtract 2 from the strategy limit. That seems to work for all common uses of BAS_BULKREAD currently. I suspect that we should subtract one in read_stream.c (to account for the buffer returned by read_stream_next_buffer()) and one in GetAccessStrategyPinLimit() (to account for pins held externally). The 18 issue: Right now there's a noticeable performance drop when going from a seqscan that doesn't use BAS_BULKREAD to one that uses BAS_BULKREAD. This is very pronounced when not hitting the OS page cache, but noticeable even otherwise. The reason for that slowdown is that BAS_BULKREAD is so small that it just allows 1-2 IOs in flight (io_combine_limit = 16 / 128kB vs BAS_BULKREAD of 256kB). Whereas, with a smaller table that doesn't hit BAS_BULKREAD, we can have 16 concurrent with the default settings, assuming a large enough shared buffers (on very small s_b we'll limit how many buffers we pin). The obvious solution to that would be to increase BAS_BULKREAD substantially above 256kB. For quite a while was worried about increasing the size, because somewhere (I couldn't find it while writing this email, will add the reference once I refound it) we have a comment explaining that a small size was chosen because it helps with CPU cache efficiency. Initially I could indeed see reasonably sized effects, but a good chunk of that turned out to be the issue above, where small sizes simply wouln't actually use the ring and thus have completely different performance characteristics. I *can* easily reproduce effects of doing very large reads when the data is in the page cache, which indeed seems to be related to cache effects, which seems to be related to some CPU oddities around crossing kernel/user memory, see [3]. But that's a separate issue from the ring size itself. I also can reproduce some negative effects of using larger ring sizes for io_method=sync. A 10GB pg_prewarm(), that I extended with the ability to use a strategy of a certain size, slows down ~22% when going from 8MB to 16MB. Obviously io_method=sync does not benefit from a larger ring size, as it just executes IO's synchronous, which is not limited by the ring size. The slowdown for io_method=worker for very small ring sizes is substantial, whether the buffer is in the page cache or not. It simply limits the asynchronizity too much. At 256kB io_method=worker is ~50% slower than io_method=sync, at 512 it's 20% faster, at 1MB io_method=worker is 2.5x faster. With io_uring there is is a 7% regression at 128kB (lower than the current default), at 256kB io_uring is 5% faster, reaching 1.9x at 3MB. I think we should consider increasing BAS_BULKREAD TO something like Min(256, io_combine_limit * (effective_io_concurrency + 1)) As I said earlier, I think we need to redesign strategies, but this seems to address the regression when going from no-strategy to strategy, without causing any meaninful regressions. I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and couldn't come up with any benefits. It seems to hurt fairly quickly. Greetings, Andres Freund [1] The whole idea of reusing buffers after a set distance is close to nonsensical for BAS_BULKREAD - we should reuse them as soon as viable (i.e. not pinned anymore), for cache efficiency. The only reason deferring the reuse makes *some* sense is [2]. It doesn't even really make sense for the other BAS_'s. What we should control there is the amount of dirty buffers and how to flush WAL at a sensible rate. But with BAS_VACUUM that doesn't work at all, as it's used for both reads and writes. If vacuum does't modify buffers, we can end up flushing WAL way too frequently and if we only rarely modify, the amount of buffers in the ring is too big. [2] For synchronize_seqscans to work, the BAS_BULKREAD size needs to be smaller than SYNC_SCAN_REPORT_INTERVAL. The factor is 2x right now - way way too small with IO rates achievable on even a cheap laptop. Cheap storage can do reads at gigabytes/second and one backend can process many hundreds of MB each second. Having only 128kB between SYNC_SCAN_REPORT_INTERVAL and the BAS_BULKREAD makes it very likely that we have to re-read the buffer from the OS. That's bad with buffered IO, catastrophic with direct IO. On a local NVMe SSDs on a large table with the normal BAS_BULKREAD I often get effectively *no* buffer hits when running two concurrent seqscans, even if I disable query parallelism. Sometimes in the 10s, sometimes in the low thousands, always < 0.1%. At 16MB I get 25% hits, at 128MB it's close to 50% (the max you could get). With parallelism I see very low hit rates even with 16MB, only around but with 256MB it starts to get better. [3] https://postgr.es/m/yhklc3wuxt4l42tpah37rzsxoycresoiae22h2eluotrwr37gq%403r54w5zqldwn
pgsql-hackers by date: