Re: Some read stream improvements - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Some read stream improvements |
Date | |
Msg-id | 4zrbkuxva3yecbtmsgxhdu6lsu3cld7ejtbunocy3z46p7jkp5@ew2k27u2fxsm Whole thread Raw |
In response to | Re: Some read stream improvements (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Some read stream improvements
|
List | pgsql-hackers |
Hi, On 2025-03-12 07:35:46 +1300, Thomas Munro wrote: > On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote: > > > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote: > > > > I was working on expanding tests for AIO and as part of that wrote a test for > > > > temp tables -- our coverage is fairly awful, there were many times during AIO > > > > development where I knew I had trivially reachable temp table specific bugs > > > > but all tests passed. > > > > > > > > The test for that does trigger the problem described above and is fixed by the > > > > patches in this thread (which I included in the other thread): > > Here is a subset of those patches again: > > 1. Per-backend buffer limit, take III. Now the check is in > read_stream_start_pending_read() so TOC == TOU. > > Annoyingly, test cases like the one below still fail, despite > following the rules. The other streams eat all the buffers and then > one gets an allowance of zero, but uses its right to take one pin > anyway to make progress, and there isn't one. I think that may be ok. If there are no unpinned buffers, it seems to be expected that starting a new stream will fail. That's not the same as - as we did previously - failing in a read stream that did start successfully, because we issue large IOs even though there are only a small number of unpinned buffers. > I wonder if we should use temp_buffers - 100? Then leave the minimum GUC > value at 100 still, so you have an easy way to test with 0, 1, > ... additional buffers? I think that just makes it harder to test the exhaustion scenario without really fixing anything? > 2. It shouldn't give up issuing random advice immediately after a > jump, or it could stall on (say) the second 128kB of a 256kB > sequential chunk (ie the strace you showed on the BHS thread). It > only makes sense to assume kernel readahead takes over once you've > actually *read* sequentially. In practice this makes it a lot more > aggressive about advice (like the BHS code in master): it only gives > up if the whole look-ahead window is sequential. > 3. Change the distance algorithm to care only about hits and misses, > not sequential heuristics. It made at least some sense before, but it > doesn't make sense for AIO, and even in synchronous mode it means that > you hit random jumps with insufficient look-ahead, so I don't think we > should keep it. > > I also realised that the sequential heuristics are confused by that > hidden trailing block thing, so in contrived pattern testing with > hit-miss-hit-miss... would be considered sequential, and even if you > fix that (the forwarding patches above fix that), an exact > hit-miss-hit-miss pattern also gets stuck between distances 1 and 2 > (double, decrement, double, ... might be worth waiting a bit longer > before decrementing, IDK. > > I'll rebase the others and post soon. > + > +/* see GetAdditionalPinLimit() */ > +uint32 > +GetAdditionalLocalPinLimit(void) > +{ > + Assert(NLocalPinnedBuffers <= num_temp_buffers); > + return num_temp_buffers - NLocalPinnedBuffers; > +} This doesn't behave quite the way GetAdditionalPinLimit() does - it can return 0. Which makes some sense, pinning an additional buffer will always fail. Perhaps worth calling out though? > static void > read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > { > while (stream->ios_in_progress < stream->max_ios && > - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) > + ((stream->pinned_buffers == 0 && stream->distance > 0) || > + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)) What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really mean? And when would it be true when the pre-existing condition wouldn't already be true? > { > BlockNumber blocknum; > int16 buffer_index; > void *per_buffer_data; > > - if (stream->pending_read_nblocks == io_combine_limit) > + /* If have a pending read that can't be extended, start it now. */ > + Assert(stream->pinned_buffers + stream->pending_read_nblocks <= > + stream->max_pinned_buffers); > + if (stream->pending_read_nblocks == io_combine_limit || > + (stream->pinned_buffers == 0 && > + stream->pending_read_nblocks == stream->max_pinned_buffers)) > { > read_stream_start_pending_read(stream, suppress_advice); > suppress_advice = false; > @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) > /* We have to start the pending read before we can build another. */ > while (stream->pending_read_nblocks > 0) > { > - read_stream_start_pending_read(stream, suppress_advice); > - suppress_advice = false; > - if (stream->ios_in_progress == stream->max_ios) > + if (!read_stream_start_pending_read(stream, suppress_advice) || > + stream->ios_in_progress == stream->max_ios) > { > - /* And we've hit the limit. Rewind, and stop here. */ > + /* And we've hit a buffer or I/O limit. Rewind and wait. */ > read_stream_unget_block(stream, blocknum); > return; > } > + > + suppress_advice = false; > } If read_stream_start_pending_read() returns false because we hit the pin limit, does it really help to call read_stream_unget_block()? IIUC that'll defer one block for later - but what about the other buffers in a multi-block read? > @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) > else > Assert(stream->next_buffer_index == stream->oldest_buffer_index); > > - /* > - * If advice hasn't been suppressed, this system supports it, and this > - * isn't a strictly sequential pattern, then we'll issue advice. > - */ > - if (!suppress_advice && > - stream->advice_enabled && > - stream->pending_read_blocknum != stream->seq_blocknum) > + /* Do we need to issue read-ahead advice? */ > + flags = 0; > + if (stream->advice_enabled) > + { > flags = READ_BUFFERS_ISSUE_ADVICE; > - else > - flags = 0; > + > + if (stream->pending_read_blocknum == stream->seq_blocknum) > + { > + /* > + * Suppress advice if our WaitReadBuffers() calls have caught up > + * with the first advice we issued for this sequential run. > + */ > + if (stream->seq_start == InvalidBlockNumber) > + suppress_advice = true; > + } > + else > + { > + /* Random jump, so start a new sequential run. */ > + stream->seq_start = stream->pending_read_blocknum; > + } > + > + if (suppress_advice) > + flags = 0; > + } Seems a bit confusing to first set flags = READ_BUFFERS_ISSUE_ADVICE to then later unset it again. Maybe just set it in if (!suppress_advice)? > * Skip the initial ramp-up phase if the caller says we're going to be > @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) > distance = stream->distance * 2; > distance = Min(distance, stream->max_pinned_buffers); > stream->distance = distance; > + > + /* > + * If we've caught up with the first advice issued for the current > + * sequential run, cancel further advice until the next random > + * jump. The kernel should be able to see the pattern now that > + * we're issuing sequential preadv() calls. > + */ > + if (stream->ios[io_index].op.blocknum == stream->seq_start) > + stream->seq_start = InvalidBlockNumber; So stream->seq_start doesn't really denote the start of sequentialness, it denotes up to where the caller needs to process before we disable sequential access. Maybe add a comment to it and rename it to something like ->seq_until_processed? Other than this the approach seems to make sense! Greetings, Andres Freund
pgsql-hackers by date: