Thread: Re: Some read stream improvements
Hi, On 2025-02-17 17:55:09 +1300, Thomas Munro wrote: > 0004-Respect-pin-limits-accurately-in-read_stream.c.patch > > The current coding only computes the remaining "fair share" of the > buffer pool for this backend at stream initialisation. It's hard, but > not impossible, to get one backend to pin more than 1/max_connections > of the buffer pool (the intended cap), when using multiple streams at > the same time in one backend. This patch moves the time of check to > the time of use, so it respects the limit strictly. I avoid adding > any changes to the fast path for all-cached streams, which only pin > one buffer at a time. I was working on expanding tests for AIO and as part of that wrote a test for temp tables -- our coverage is fairly awful, there were many times during AIO development where I knew I had trivially reachable temp table specific bugs but all tests passed. The test for that does trigger the problem described above and is fixed by the patches in this thread (which I included in the other thread): https://postgr.es/m/knr4aazlaa4nj3xnpe4tu6plwayovzxhmteatcpry2j6a6kc4v%40aonkl53s2ecs Just linked instead of attached to not trigger cfbot. Greetings, Andres Freund
On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote: > I was working on expanding tests for AIO and as part of that wrote a test for > temp tables -- our coverage is fairly awful, there were many times during AIO > development where I knew I had trivially reachable temp table specific bugs > but all tests passed. > > The test for that does trigger the problem described above and is fixed by the > patches in this thread (which I included in the other thread): Thanks. Alright, I'm assuming that you don't have any objections to the way I restyled that API, so I'm going to go ahead and push some of these shortly, and then follow up with a few newer patches that simplify and improve the look-ahead and advice control. More very soon.
On Thu, Feb 27, 2025 at 11:19 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote: > > I was working on expanding tests for AIO and as part of that wrote a test for > > temp tables -- our coverage is fairly awful, there were many times during AIO > > development where I knew I had trivially reachable temp table specific bugs > > but all tests passed. > > > > The test for that does trigger the problem described above and is fixed by the > > patches in this thread (which I included in the other thread): > > Thanks. Alright, I'm assuming that you don't have any objections to > the way I restyled that API, so I'm going to go ahead and push some of > these shortly, and then follow up with a few newer patches that > simplify and improve the look-ahead and advice control. More very > soon. Ugh, I realised in another round of self-review that that version could exceed the soft limit by a small amount if the registered callback pins more buffers underneath it, so not pushed yet. I think I see how to fix that (namely the alternative design that a comment already contemplated), more soon...
On 2025-02-27 11:19:55 +1300, Thomas Munro wrote: > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote: > > I was working on expanding tests for AIO and as part of that wrote a test for > > temp tables -- our coverage is fairly awful, there were many times during AIO > > development where I knew I had trivially reachable temp table specific bugs > > but all tests passed. > > > > The test for that does trigger the problem described above and is fixed by the > > patches in this thread (which I included in the other thread): > > Thanks. Alright, I'm assuming that you don't have any objections to > the way I restyled that API, so I'm going to go ahead and push some of > these shortly, and then follow up with a few newer patches that > simplify and improve the look-ahead and advice control. More very > soon. Indeed, no objections, rather the opposite. Thanks! Greetings, Andres Freund
On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote: > > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote: > > > I was working on expanding tests for AIO and as part of that wrote a test for > > > temp tables -- our coverage is fairly awful, there were many times during AIO > > > development where I knew I had trivially reachable temp table specific bugs > > > but all tests passed. > > > > > > The test for that does trigger the problem described above and is fixed by the > > > patches in this thread (which I included in the other thread): Here is a subset of those patches again: 1. Per-backend buffer limit, take III. Now the check is in read_stream_start_pending_read() so TOC == TOU. Annoyingly, test cases like the one below still fail, despite following the rules. The other streams eat all the buffers and then one gets an allowance of zero, but uses its right to take one pin anyway to make progress, and there isn't one. I wonder if we should use temp_buffers - 100? Then leave the minimum GUC value at 100 still, so you have an easy way to test with 0, 1, ... additional buffers? 2. It shouldn't give up issuing random advice immediately after a jump, or it could stall on (say) the second 128kB of a 256kB sequential chunk (ie the strace you showed on the BHS thread). It only makes sense to assume kernel readahead takes over once you've actually *read* sequentially. In practice this makes it a lot more aggressive about advice (like the BHS code in master): it only gives up if the whole look-ahead window is sequential. 3. Change the distance algorithm to care only about hits and misses, not sequential heuristics. It made at least some sense before, but it doesn't make sense for AIO, and even in synchronous mode it means that you hit random jumps with insufficient look-ahead, so I don't think we should keep it. I also realised that the sequential heuristics are confused by that hidden trailing block thing, so in contrived pattern testing with hit-miss-hit-miss... would be considered sequential, and even if you fix that (the forwarding patches above fix that), an exact hit-miss-hit-miss pattern also gets stuck between distances 1 and 2 (double, decrement, double, ... might be worth waiting a bit longer before decrementing, IDK. I'll rebase the others and post soon. set io_combine_limit = 32; set temp_buffers = 100; create temp table t1 as select generate_series(1, 10000); create temp table t2 as select generate_series(1, 10000); create temp table t3 as select generate_series(1, 10000); create temp table t4 as select generate_series(1, 10000); create temp table t5 as select generate_series(1, 10000); do $$ declare c1 cursor for select * from t1; c2 cursor for select * from t2; c3 cursor for select * from t3; c4 cursor for select * from t4; c5 cursor for select * from t5; x record; begin open c1; open c2; open c3; open c4; open c5; loop fetch next from c1 into x; exit when not found; fetch next from c2 into x; exit when not found; fetch next from c3 into x; exit when not found; fetch next from c4 into x; exit when not found; fetch next from c5 into x; exit when not found; end loop; end; $$;
Attachment
Hi, On 2025-03-12 07:35:46 +1300, Thomas Munro wrote: > On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote: > > > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote: > > > > I was working on expanding tests for AIO and as part of that wrote a test for > > > > temp tables -- our coverage is fairly awful, there were many times during AIO > > > > development where I knew I had trivially reachable temp table specific bugs > > > > but all tests passed. > > > > > > > > The test for that does trigger the problem described above and is fixed by the > > > > patches in this thread (which I included in the other thread): > > Here is a subset of those patches again: > > 1. Per-backend buffer limit, take III. Now the check is in > read_stream_start_pending_read() so TOC == TOU. > > Annoyingly, test cases like the one below still fail, despite > following the rules. The other streams eat all the buffers and then > one gets an allowance of zero, but uses its right to take one pin > anyway to make progress, and there isn't one. I think that may be ok. If there are no unpinned buffers, it seems to be expected that starting a new stream will fail. That's not the same as - as we did previously - failing in a read stream that did start successfully, because we issue large IOs even though there are only a small number of unpinned buffers. > I wonder if we should use temp_buffers - 100? Then leave the minimum GUC > value at 100 still, so you have an easy way to test with 0, 1, > ... additional buffers? I think that just makes it harder to test the exhaustion scenario without really fixing anything? > 2. It shouldn't give up issuing random advice immediately after a > jump, or it could stall on (say) the second 128kB of a 256kB > sequential chunk (ie the strace you showed on the BHS thread). It > only makes sense to assume kernel readahead takes over once you've > actually *read* sequentially. In practice this makes it a lot more > aggressive about advice (like the BHS code in master): it only gives > up if the whole look-ahead window is sequential. > 3. Change the distance algorithm to care only about hits and misses, > not sequential heuristics. It made at least some sense before, but it > doesn't make sense for AIO, and even in synchronous mode it means that > you hit random jumps with insufficient look-ahead, so I don't think we > should keep it. > > I also realised that the sequential heuristics are confused by that > hidden trailing block thing, so in contrived pattern testing with > hit-miss-hit-miss... would be considered sequential, and even if you > fix that (the forwarding patches above fix that), an exact > hit-miss-hit-miss pattern also gets stuck between distances 1 and 2 > (double, decrement, double, ... might be worth waiting a bit longer > before decrementing, IDK. > > I'll rebase the others and post soon. > + > +/* see GetAdditionalPinLimit() */ > +uint32 > +GetAdditionalLocalPinLimit(void) > +{ > + Assert(NLocalPinnedBuffers <= num_temp_buffers); > + return num_temp_buffers - NLocalPinnedBuffers; > +} This doesn't behave quite the way GetAdditionalPinLimit() does - it can return 0. Which makes some sense, pinning an additional buffer will always fail. Perhaps worth calling out though? > static void > read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > { > while (stream->ios_in_progress < stream->max_ios && > - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) > + ((stream->pinned_buffers == 0 && stream->distance > 0) || > + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)) What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really mean? And when would it be true when the pre-existing condition wouldn't already be true? > { > BlockNumber blocknum; > int16 buffer_index; > void *per_buffer_data; > > - if (stream->pending_read_nblocks == io_combine_limit) > + /* If have a pending read that can't be extended, start it now. */ > + Assert(stream->pinned_buffers + stream->pending_read_nblocks <= > + stream->max_pinned_buffers); > + if (stream->pending_read_nblocks == io_combine_limit || > + (stream->pinned_buffers == 0 && > + stream->pending_read_nblocks == stream->max_pinned_buffers)) > { > read_stream_start_pending_read(stream, suppress_advice); > suppress_advice = false; > @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > /* We have to start the pending read before we can build another. */ > while (stream->pending_read_nblocks > 0) > { > - read_stream_start_pending_read(stream, suppress_advice); > - suppress_advice = false; > - if (stream->ios_in_progress == stream->max_ios) > + if (!read_stream_start_pending_read(stream, suppress_advice) || > + stream->ios_in_progress == stream->max_ios) > { > - /* And we've hit the limit. Rewind, and stop here. */ > + /* And we've hit a buffer or I/O limit. Rewind and wait. */ > read_stream_unget_block(stream, blocknum); > return; > } > + > + suppress_advice = false; > } If read_stream_start_pending_read() returns false because we hit the pin limit, does it really help to call read_stream_unget_block()? IIUC that'll defer one block for later - but what about the other buffers in a multi-block read? > @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) > else > Assert(stream->next_buffer_index == stream->oldest_buffer_index); > > - /* > - * If advice hasn't been suppressed, this system supports it, and this > - * isn't a strictly sequential pattern, then we'll issue advice. > - */ > - if (!suppress_advice && > - stream->advice_enabled && > - stream->pending_read_blocknum != stream->seq_blocknum) > + /* Do we need to issue read-ahead advice? */ > + flags = 0; > + if (stream->advice_enabled) > + { > flags = READ_BUFFERS_ISSUE_ADVICE; > - else > - flags = 0; > + > + if (stream->pending_read_blocknum == stream->seq_blocknum) > + { > + /* > + * Suppress advice if our WaitReadBuffers() calls have caught up > + * with the first advice we issued for this sequential run. > + */ > + if (stream->seq_start == InvalidBlockNumber) > + suppress_advice = true; > + } > + else > + { > + /* Random jump, so start a new sequential run. */ > + stream->seq_start = stream->pending_read_blocknum; > + } > + > + if (suppress_advice) > + flags = 0; > + } Seems a bit confusing to first set flags = READ_BUFFERS_ISSUE_ADVICE to then later unset it again. Maybe just set it in if (!suppress_advice)? > * Skip the initial ramp-up phase if the caller says we're going to be > @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) > distance = stream->distance * 2; > distance = Min(distance, stream->max_pinned_buffers); > stream->distance = distance; > + > + /* > + * If we've caught up with the first advice issued for the current > + * sequential run, cancel further advice until the next random > + * jump. The kernel should be able to see the pattern now that > + * we're issuing sequential preadv() calls. > + */ > + if (stream->ios[io_index].op.blocknum == stream->seq_start) > + stream->seq_start = InvalidBlockNumber; So stream->seq_start doesn't really denote the start of sequentialness, it denotes up to where the caller needs to process before we disable sequential access. Maybe add a comment to it and rename it to something like ->seq_until_processed? Other than this the approach seems to make sense! Greetings, Andres Freund
On Wed, Mar 12, 2025 at 8:29 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-03-12 07:35:46 +1300, Thomas Munro wrote: > > On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote: > > I wonder if we should use temp_buffers - 100? Then leave the minimum GUC > > value at 100 still, so you have an easy way to test with 0, 1, > > ... additional buffers? > > I think that just makes it harder to test the exhaustion scenario without > really fixing anything? Not sure I agree yet but I'll come back to this in a bit (I think there might be something worth thinking about some more here but it's not in the way of committing these patches). > > +/* see GetAdditionalPinLimit() */ > > +uint32 > > +GetAdditionalLocalPinLimit(void) > > +{ > > + Assert(NLocalPinnedBuffers <= num_temp_buffers); > > + return num_temp_buffers - NLocalPinnedBuffers; > > +} > > This doesn't behave quite the way GetAdditionalPinLimit() does - it can return > 0. Which makes some sense, pinning an additional buffer will always > fail. Perhaps worth calling out though? No, GetAdditionalPinLimit() works that way too. It's only LimitAdditional[Local]PinLimit() that has the special "you can always have one" logic that I needed to escape from. But yes I should highlight that in a comment: done above GetAdditionalPinLimit(). GetAdditionalLocalPinLimit() just tells you to see the shared version. > > static void > > read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > > { > > while (stream->ios_in_progress < stream->max_ios && > > - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) > > + ((stream->pinned_buffers == 0 && stream->distance > 0) || > > + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)) > > What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really > mean? And when would it be true when the pre-existing condition wouldn't > already be true? Well the reason is basically that the distance can get chomped lower than pending_read_nblocks if you recently hit the pin limit, and we have to be able to start your I/O anyway (at least one block of it) to make progress. But I realised that was a stupid way to handle that, and actually I had screwed up further down, and the right way is just: @@ -382,15 +435,25 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) * io_combine_limit size once more buffers have been consumed. However, * if we've already reached io_combine_limit, or we've reached the * distance limit and there isn't anything pinned yet, or the callback has - * signaled end-of-stream, we start the read immediately. + * signaled end-of-stream, we start the read immediately. Note that the + * pending read could even exceed the distance goal, if the latter was + * reduced on buffer limit exhaustion. */ if (stream->pending_read_nblocks > 0 && (stream->pending_read_nblocks == stream->io_combine_limit || - (stream->pending_read_nblocks == stream->distance && + (stream->pending_read_nblocks >= stream->distance && stream->pinned_buffers == 0) || stream->distance == 0) && stream->ios_in_progress < stream->max_ios) read_stream_start_pending_read(stream, suppress_advice); > > { > > BlockNumber blocknum; > > int16 buffer_index; > > void *per_buffer_data; > > > > - if (stream->pending_read_nblocks == io_combine_limit) > > + /* If have a pending read that can't be extended, start it now. */ > > + Assert(stream->pinned_buffers + stream->pending_read_nblocks <= > > + stream->max_pinned_buffers); > > + if (stream->pending_read_nblocks == io_combine_limit || > > + (stream->pinned_buffers == 0 && > > + stream->pending_read_nblocks == stream->max_pinned_buffers)) > > { > > read_stream_start_pending_read(stream, suppress_advice); > > suppress_advice = false; > > @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > > /* We have to start the pending read before we can build another. */ > > while (stream->pending_read_nblocks > 0) > > { > > - read_stream_start_pending_read(stream, suppress_advice); > > - suppress_advice = false; > > - if (stream->ios_in_progress == stream->max_ios) > > + if (!read_stream_start_pending_read(stream, suppress_advice) || > > + stream->ios_in_progress == stream->max_ios) > > { > > - /* And we've hit the limit. Rewind, and stop here. */ > > + /* And we've hit a buffer or I/O limit. Rewind and wait. */ > > read_stream_unget_block(stream, blocknum); > > return; > > } > > + > > + suppress_advice = false; > > } > > If read_stream_start_pending_read() returns false because we hit the pin > limit, does it really help to call read_stream_unget_block()? IIUC that'll > defer one block for later - but what about the other buffers in a multi-block > read? They are already represented by (pending_read_blocknum, pending_read_nblocks). We were unable to append this block number to the pending read because it's already full-sized or the newly acquired block number isn't consecutive, but we were also unable to start the pending read to get it out of the way. That was a pre-existing edge case that you could already hit if it turned out to be a short read: ie the remaining part of the pending read is still in your way, and now you've reached stream->max_ios so you can't start it. So we had to put it aside for later. This change piggy-backs on that approach for buffer starvation: read_stream_start_buffers() can now decline to start even a partial read. In fact it usually declines, unless it is forced to accept a short read because stream->pinned_buffers == 0 (ie, we have to do whatever we can to make progress). It's OK that pending_read_nblocks exceeds what we can start right now, we still remember it, and we can always start it one block at a time if it comes to it. Make sense? > > @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) > > else > > Assert(stream->next_buffer_index == stream->oldest_buffer_index); > > > > - /* > > - * If advice hasn't been suppressed, this system supports it, and this > > - * isn't a strictly sequential pattern, then we'll issue advice. > > - */ > > - if (!suppress_advice && > > - stream->advice_enabled && > > - stream->pending_read_blocknum != stream->seq_blocknum) > > + /* Do we need to issue read-ahead advice? */ > > + flags = 0; > > + if (stream->advice_enabled) > > + { > > flags = READ_BUFFERS_ISSUE_ADVICE; > > - else > > - flags = 0; > > + > > + if (stream->pending_read_blocknum == stream->seq_blocknum) > > + { > > + /* > > + * Suppress advice if our WaitReadBuffers() calls have caught up > > + * with the first advice we issued for this sequential run. > > + */ > > + if (stream->seq_start == InvalidBlockNumber) > > + suppress_advice = true; > > + } > > + else > > + { > > + /* Random jump, so start a new sequential run. */ > > + stream->seq_start = stream->pending_read_blocknum; > > + } > > + > > + if (suppress_advice) > > + flags = 0; > > + } > > Seems a bit confusing to first set > flags = READ_BUFFERS_ISSUE_ADVICE > to then later unset it again. Maybe just set it in if (!suppress_advice)? Yeah that was a bit too tangly. I found a better expression of that logic, which also removed that annoying suppress_advice function argument. I hope this is much clearer. > > * Skip the initial ramp-up phase if the caller says we're going to be > > @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) > > distance = stream->distance * 2; > > distance = Min(distance, stream->max_pinned_buffers); > > stream->distance = distance; > > + > > + /* > > + * If we've caught up with the first advice issued for the current > > + * sequential run, cancel further advice until the next random > > + * jump. The kernel should be able to see the pattern now that > > + * we're issuing sequential preadv() calls. > > + */ > > + if (stream->ios[io_index].op.blocknum == stream->seq_start) > > + stream->seq_start = InvalidBlockNumber; > > So stream->seq_start doesn't really denote the start of sequentialness, it > denotes up to where the caller needs to process before we disable sequential > access. Maybe add a comment to it and rename it to something like > ->seq_until_processed? WFM. > Other than this the approach seems to make sense! Cool, so I'm planning to start pushing the earlier ones tomorrow. Here also are the buffer forwarding ones, rebased on top. Here's some strace art showing the old and new advice for patch 0003. I traced ANALYZE io_combine_limit=8 and used different default_statistics_targets to find interesting test cases. The "bracket" on the right shows a sequential range of blocks. Master only calls fadvise once per sequential chunk: fadvise ●──────────────────────╮││ 3 0.000006 ││╰─► pread 1 676..676 2 0.000007 fadvise ●─────────────────────╮││ 3 0.000006 ││╰──► pread 1 678..678 2 0.000007 fadvise ●────────────────────╮││ 3 0.000007 ││╰───► pread 3 680..682 2 0.000031 │╰────► pread 6 684..689 1 0.000015 ╰─────► pread 8 691..698 ─╮ 0 0.000018 pread 8 699..706 │ 0 0.000016 fadvise ●────────────────────────╮ │ 1 0.000007 │ pread 8 707..714 │ 1 0.000019 │ pread 7 715..721 ─╯ 1 0.000017 ╰─► pread 8 723..730 ─╮ 0 0.000016 pread 8 731..738 │ 0 0.000019 fadvise ●────────────────────────╮ │ 1 0.000007 │ pread 8 739..746 │ 1 0.000018 │ pread 5 747..751 ─╯ 1 0.000013 The patch can call three times when that's the configured I/O concurrency level, because that controls when the pread() calls catch up with the first block: fadvise ●────────────────────╮││ 3 0.000007 ││╰───► pread 2 255..256 2 0.000014 fadvise ●───────────────────╮││ 3 0.000007 ││╰────► pread 8 258..265 ─╮ 2 0.000035 │╰─────► preadv 8 266..273 │ 1 0.000021 ╰──────► pread 8 274..281 │ 0 0.000017 fadvise ●────────────────────────╮ │ 1 0.000007 │ pread 8 282..289 │ 1 0.000017 fadvise ●───────────────────────╮│ │ 2 0.000007 ││ pread 6 290..295 ─╯ 2 0.000015 fadvise ●──────────────────────╮││ 3 0.000007 ││╰─► pread 8 297..304 ─╮ 2 0.000015 fadvise ●─────────────────────╮││ │ 3 0.000007 ││╰──► pread 8 305..312 │ 2 0.000017 Purely sequential streams still get none: pread 1 0..0 ─╮ 0 0.000016 pread 2 1..2 │ 0 0.000014 pread 4 3..6 │ 0 0.000021 pread 8 7..14 │ 0 0.000034 ... blah blah blah ... pread 8 4455..4462 │ 0 0.000029 pread 8 4463..4470 │ 0 0.000026 pread 8 4471..4478 │ 0 0.000020 pread 1 4479..4479 ─╯ 0 0.000010
Attachment
- v3-0001-Improve-buffer-manager-API-for-backend-pin-limits.patch
- v3-0002-Respect-pin-limits-accurately-in-read_stream.c.patch
- v3-0003-Improve-read-stream-advice-for-large-random-chunk.patch
- v3-0004-Look-ahead-more-when-sequential-in-read_stream.c.patch
- v3-0005-Support-buffer-forwarding-in-read_stream.c.patch
- v3-0006-Support-buffer-forwarding-in-StartReadBuffers.patch
I have pushed the new pin limit patches, after some more testing and copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a change I'd like to make but it didn't belong in this commit) and dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping comprehension and isn't even true for the local buffer version (which I still think might be an issue, will come back to that, but again it seemed independent). For the record, here's a way to see 92fc6856^ or v17 misbehave and pin too many buffers without involving any other proposed patches, only v17 streaming seq scan: with shared_buffers=16MB, max_connections=4, which gives MaxProportionalBuffers == 44, the attached shows three cursors each pinning io_combine_limit = 32 buffers for a total of 96 at peak. That's because each cursor starts a stream and sees (the only time it would check) that it is allowed 44, and then we fetch in round-robin order until they all ramp up to full I/O size. In v17 we can't really do much worse than that and it requires pretty contrived settings and workload for it to be a real issue AFAIK but obviously and hopefully we soon will eg BHS and more.
Attachment
Hi, On 2025-03-14 22:03:15 +1300, Thomas Munro wrote: > I have pushed the new pin limit patches, after some more testing and > copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a > change I'd like to make but it didn't belong in this commit) and > dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping > comprehension and isn't even true for the local buffer version (which > I still think might be an issue, will come back to that, but again it > seemed independent). Something odd: I was looking to push a test that increases localbuf.c coverage, which passed with a previous version of these changes. However, it failed, and it did so on FreeBSD alone. The same test doesn't fail when tested together with the rest of the AIO work. https://cirrus-ci.com/build/5951090857869312 https://cirrus-ci.com/task/6177637229395968 I do not understand what could be OS dependent here. I tried to build with exactly the same options as CI on freebsd, but couldn't repro the issue. It's perhaps worth noting that this failed even before my recent localbuf: commits. I ran CI with a patch that adds a bunch of debug information and just runs the relevant test: https://cirrus-ci.com/task/6516935619248128 2025-03-17 16:19:14.270 UTC client backend[5526] pg_regress/temp LOG: statement: SELECT count(*), max(a) max_a, min(a) min_a,max(cnt) max_cnt FROM test_temp; 2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: NlocalPinnedBuffers=98++ 2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: NlocalPinnedBuffers=99-- 2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: pgsr create 0xde34f1f57f8: test_temp, max_pinned:100, NLocPin: 98 2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: pgsr 0xde34f1f57f8: buffer_limit: 2, pinned_buffers:0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 98 comparing that with a local run: 2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp LOG: statement: SELECT count(*), max(a) max_a, min(a)min_a, max(cnt) max_cnt FROM test_temp; 2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr create 0x56029555cad8: test_temp, max_pinned:100, NLocPin: 97 2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: buffer_limit: 3, pinned_buffers:0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 97 2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: StartReadBuffers: wait:0, pinned: 0 += 1, distance: 1 2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: buffer_limit: 3, pinned_buffers:0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 97 2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: StartReadBuffers: wait:0, pinned: 0 += 1, distance: 1 So one thing is that the pin count differs by 1 at the start of the scan. No idea why. But knowing that I was able to repro the issue locally too, by just ensuring the pin count is one higher by the start of the query. I still don't know what drives the difference between freebsd and the rest, but IIUC the reason this fails is just that we are holding too many buffers pinned, due to some buffers being pinned outside of read_stream.c. Greetings, Andres Freund
On Tue, Mar 18, 2025 at 5:56 AM Andres Freund <andres@anarazel.de> wrote: > So one thing is that the pin count differs by 1 at the start of the scan. No > idea why. > > I still don't know what drives the difference between freebsd and the rest, > but IIUC the reason this fails is just that we are holding too many buffers > pinned, due to some buffers being pinned outside of read_stream.c. I couldn't reproduce this on my local FreeBSD box, but I think I see one part of the problem: the cursor a few lines up has a stream with a higher distance holding a couple of pins. Not sure how the local buffer pool got into a state that caused that if it isn't doing the same on other machines, but anyway, if I read the test right you intend to pin strictly one page per cursor, so I tried saying so explicitly: - query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i); + query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM test_temp WHERE ctid between '(%s, 1)'::tid and '(%s, 9999)'::tid $q$, cursorname, i, i); That passed on the FreeBSD CI task.
Hi, On 2025-03-14 22:03:15 +1300, Thomas Munro wrote: > I have pushed the new pin limit patches, after some more testing and > copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a > change I'd like to make but it didn't belong in this commit) and > dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping > comprehension and isn't even true for the local buffer version (which > I still think might be an issue, will come back to that, but again it > seemed independent). I found an, easy to fix, issue in read_stream.c. If the limit returned by GetAdditionalPinLimit() is very large, the buffer_limit variable in read_stream_start_pending_read() can overflow. While the code is careful to limit buffer_limit PG_INT16_MAX, we subsequently add the number of forwarded buffers: if (stream->temporary) buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX); else buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX); Assert(stream->forwarded_buffers <= stream->pending_read_nblocks); buffer_limit += stream->forwarded_buffers; I think there's a second, theoretical, overflow issue shortly after: /* Shrink distance: no more look-ahead until buffers are released. */ new_distance = stream->pinned_buffers + buffer_limit; if (stream->distance > new_distance) stream->distance = new_distance; while that was hit in the case I was looking at, it was only hit because buffer_limit had already wrapped around into the negative. I don't think nblocks can be big enough to really hit this. I don't actually see any reason for buffer_limit to be a 16bit quantity? It's just to clamp things down, right? Greetings, Andres Freund
On Thu, Apr 3, 2025 at 11:17 AM Andres Freund <andres@anarazel.de> wrote: > I don't actually see any reason for buffer_limit to be a 16bit quantity? It's > just to clamp things down, right? Ugh. It might be worth just flipping this whole thing over to ints, let me look into that...