Re: Some read stream improvements - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Some read stream improvements |
Date | |
Msg-id | CA+hUKGKH2mwj=mpCobC4ZyBdR4ToW2hpQcYFDivj40fxf4yQow@mail.gmail.com Whole thread Raw |
In response to | Re: Some read stream improvements (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Some read stream improvements
|
List | pgsql-hackers |
On Wed, Mar 12, 2025 at 8:29 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-03-12 07:35:46 +1300, Thomas Munro wrote: > > On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote: > > I wonder if we should use temp_buffers - 100? Then leave the minimum GUC > > value at 100 still, so you have an easy way to test with 0, 1, > > ... additional buffers? > > I think that just makes it harder to test the exhaustion scenario without > really fixing anything? Not sure I agree yet but I'll come back to this in a bit (I think there might be something worth thinking about some more here but it's not in the way of committing these patches). > > +/* see GetAdditionalPinLimit() */ > > +uint32 > > +GetAdditionalLocalPinLimit(void) > > +{ > > + Assert(NLocalPinnedBuffers <= num_temp_buffers); > > + return num_temp_buffers - NLocalPinnedBuffers; > > +} > > This doesn't behave quite the way GetAdditionalPinLimit() does - it can return > 0. Which makes some sense, pinning an additional buffer will always > fail. Perhaps worth calling out though? No, GetAdditionalPinLimit() works that way too. It's only LimitAdditional[Local]PinLimit() that has the special "you can always have one" logic that I needed to escape from. But yes I should highlight that in a comment: done above GetAdditionalPinLimit(). GetAdditionalLocalPinLimit() just tells you to see the shared version. > > static void > > read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > > { > > while (stream->ios_in_progress < stream->max_ios && > > - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) > > + ((stream->pinned_buffers == 0 && stream->distance > 0) || > > + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)) > > What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really > mean? And when would it be true when the pre-existing condition wouldn't > already be true? Well the reason is basically that the distance can get chomped lower than pending_read_nblocks if you recently hit the pin limit, and we have to be able to start your I/O anyway (at least one block of it) to make progress. But I realised that was a stupid way to handle that, and actually I had screwed up further down, and the right way is just: @@ -382,15 +435,25 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) * io_combine_limit size once more buffers have been consumed. However, * if we've already reached io_combine_limit, or we've reached the * distance limit and there isn't anything pinned yet, or the callback has - * signaled end-of-stream, we start the read immediately. + * signaled end-of-stream, we start the read immediately. Note that the + * pending read could even exceed the distance goal, if the latter was + * reduced on buffer limit exhaustion. */ if (stream->pending_read_nblocks > 0 && (stream->pending_read_nblocks == stream->io_combine_limit || - (stream->pending_read_nblocks == stream->distance && + (stream->pending_read_nblocks >= stream->distance && stream->pinned_buffers == 0) || stream->distance == 0) && stream->ios_in_progress < stream->max_ios) read_stream_start_pending_read(stream, suppress_advice); > > { > > BlockNumber blocknum; > > int16 buffer_index; > > void *per_buffer_data; > > > > - if (stream->pending_read_nblocks == io_combine_limit) > > + /* If have a pending read that can't be extended, start it now. */ > > + Assert(stream->pinned_buffers + stream->pending_read_nblocks <= > > + stream->max_pinned_buffers); > > + if (stream->pending_read_nblocks == io_combine_limit || > > + (stream->pinned_buffers == 0 && > > + stream->pending_read_nblocks == stream->max_pinned_buffers)) > > { > > read_stream_start_pending_read(stream, suppress_advice); > > suppress_advice = false; > > @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > > /* We have to start the pending read before we can build another. */ > > while (stream->pending_read_nblocks > 0) > > { > > - read_stream_start_pending_read(stream, suppress_advice); > > - suppress_advice = false; > > - if (stream->ios_in_progress == stream->max_ios) > > + if (!read_stream_start_pending_read(stream, suppress_advice) || > > + stream->ios_in_progress == stream->max_ios) > > { > > - /* And we've hit the limit. Rewind, and stop here. */ > > + /* And we've hit a buffer or I/O limit. Rewind and wait. */ > > read_stream_unget_block(stream, blocknum); > > return; > > } > > + > > + suppress_advice = false; > > } > > If read_stream_start_pending_read() returns false because we hit the pin > limit, does it really help to call read_stream_unget_block()? IIUC that'll > defer one block for later - but what about the other buffers in a multi-block > read? They are already represented by (pending_read_blocknum, pending_read_nblocks). We were unable to append this block number to the pending read because it's already full-sized or the newly acquired block number isn't consecutive, but we were also unable to start the pending read to get it out of the way. That was a pre-existing edge case that you could already hit if it turned out to be a short read: ie the remaining part of the pending read is still in your way, and now you've reached stream->max_ios so you can't start it. So we had to put it aside for later. This change piggy-backs on that approach for buffer starvation: read_stream_start_buffers() can now decline to start even a partial read. In fact it usually declines, unless it is forced to accept a short read because stream->pinned_buffers == 0 (ie, we have to do whatever we can to make progress). It's OK that pending_read_nblocks exceeds what we can start right now, we still remember it, and we can always start it one block at a time if it comes to it. Make sense? > > @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) > > else > > Assert(stream->next_buffer_index == stream->oldest_buffer_index); > > > > - /* > > - * If advice hasn't been suppressed, this system supports it, and this > > - * isn't a strictly sequential pattern, then we'll issue advice. > > - */ > > - if (!suppress_advice && > > - stream->advice_enabled && > > - stream->pending_read_blocknum != stream->seq_blocknum) > > + /* Do we need to issue read-ahead advice? */ > > + flags = 0; > > + if (stream->advice_enabled) > > + { > > flags = READ_BUFFERS_ISSUE_ADVICE; > > - else > > - flags = 0; > > + > > + if (stream->pending_read_blocknum == stream->seq_blocknum) > > + { > > + /* > > + * Suppress advice if our WaitReadBuffers() calls have caught up > > + * with the first advice we issued for this sequential run. > > + */ > > + if (stream->seq_start == InvalidBlockNumber) > > + suppress_advice = true; > > + } > > + else > > + { > > + /* Random jump, so start a new sequential run. */ > > + stream->seq_start = stream->pending_read_blocknum; > > + } > > + > > + if (suppress_advice) > > + flags = 0; > > + } > > Seems a bit confusing to first set > flags = READ_BUFFERS_ISSUE_ADVICE > to then later unset it again. Maybe just set it in if (!suppress_advice)? Yeah that was a bit too tangly. I found a better expression of that logic, which also removed that annoying suppress_advice function argument. I hope this is much clearer. > > * Skip the initial ramp-up phase if the caller says we're going to be > > @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) > > distance = stream->distance * 2; > > distance = Min(distance, stream->max_pinned_buffers); > > stream->distance = distance; > > + > > + /* > > + * If we've caught up with the first advice issued for the current > > + * sequential run, cancel further advice until the next random > > + * jump. The kernel should be able to see the pattern now that > > + * we're issuing sequential preadv() calls. > > + */ > > + if (stream->ios[io_index].op.blocknum == stream->seq_start) > > + stream->seq_start = InvalidBlockNumber; > > So stream->seq_start doesn't really denote the start of sequentialness, it > denotes up to where the caller needs to process before we disable sequential > access. Maybe add a comment to it and rename it to something like > ->seq_until_processed? WFM. > Other than this the approach seems to make sense! Cool, so I'm planning to start pushing the earlier ones tomorrow. Here also are the buffer forwarding ones, rebased on top. Here's some strace art showing the old and new advice for patch 0003. I traced ANALYZE io_combine_limit=8 and used different default_statistics_targets to find interesting test cases. The "bracket" on the right shows a sequential range of blocks. Master only calls fadvise once per sequential chunk: fadvise ●──────────────────────╮││ 3 0.000006 ││╰─► pread 1 676..676 2 0.000007 fadvise ●─────────────────────╮││ 3 0.000006 ││╰──► pread 1 678..678 2 0.000007 fadvise ●────────────────────╮││ 3 0.000007 ││╰───► pread 3 680..682 2 0.000031 │╰────► pread 6 684..689 1 0.000015 ╰─────► pread 8 691..698 ─╮ 0 0.000018 pread 8 699..706 │ 0 0.000016 fadvise ●────────────────────────╮ │ 1 0.000007 │ pread 8 707..714 │ 1 0.000019 │ pread 7 715..721 ─╯ 1 0.000017 ╰─► pread 8 723..730 ─╮ 0 0.000016 pread 8 731..738 │ 0 0.000019 fadvise ●────────────────────────╮ │ 1 0.000007 │ pread 8 739..746 │ 1 0.000018 │ pread 5 747..751 ─╯ 1 0.000013 The patch can call three times when that's the configured I/O concurrency level, because that controls when the pread() calls catch up with the first block: fadvise ●────────────────────╮││ 3 0.000007 ││╰───► pread 2 255..256 2 0.000014 fadvise ●───────────────────╮││ 3 0.000007 ││╰────► pread 8 258..265 ─╮ 2 0.000035 │╰─────► preadv 8 266..273 │ 1 0.000021 ╰──────► pread 8 274..281 │ 0 0.000017 fadvise ●────────────────────────╮ │ 1 0.000007 │ pread 8 282..289 │ 1 0.000017 fadvise ●───────────────────────╮│ │ 2 0.000007 ││ pread 6 290..295 ─╯ 2 0.000015 fadvise ●──────────────────────╮││ 3 0.000007 ││╰─► pread 8 297..304 ─╮ 2 0.000015 fadvise ●─────────────────────╮││ │ 3 0.000007 ││╰──► pread 8 305..312 │ 2 0.000017 Purely sequential streams still get none: pread 1 0..0 ─╮ 0 0.000016 pread 2 1..2 │ 0 0.000014 pread 4 3..6 │ 0 0.000021 pread 8 7..14 │ 0 0.000034 ... blah blah blah ... pread 8 4455..4462 │ 0 0.000029 pread 8 4463..4470 │ 0 0.000026 pread 8 4471..4478 │ 0 0.000020 pread 1 4479..4479 ─╯ 0 0.000010
Attachment
- v3-0001-Improve-buffer-manager-API-for-backend-pin-limits.patch
- v3-0002-Respect-pin-limits-accurately-in-read_stream.c.patch
- v3-0003-Improve-read-stream-advice-for-large-random-chunk.patch
- v3-0004-Look-ahead-more-when-sequential-in-read_stream.c.patch
- v3-0005-Support-buffer-forwarding-in-read_stream.c.patch
- v3-0006-Support-buffer-forwarding-in-StartReadBuffers.patch
pgsql-hackers by date: