Re: Some read stream improvements - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Some read stream improvements
Date
Msg-id 4zrbkuxva3yecbtmsgxhdu6lsu3cld7ejtbunocy3z46p7jkp5@ew2k27u2fxsm
Whole thread Raw
In response to Re: Some read stream improvements  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Some read stream improvements
List pgsql-hackers
Hi,

On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
> On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I was working on expanding tests for AIO and as part of that wrote a test for
> > > > temp tables -- our coverage is fairly awful, there were many times during AIO
> > > > development where I knew I had trivially reachable temp table specific bugs
> > > > but all tests passed.
> > > >
> > > > The test for that does trigger the problem described above and is fixed by the
> > > > patches in this thread (which I included in the other thread):
> 
> Here is a subset of those patches again:
> 
> 1.  Per-backend buffer limit, take III.  Now the check is in
> read_stream_start_pending_read() so TOC == TOU.
> 
> Annoyingly, test cases like the one below still fail, despite
> following the rules.  The other streams eat all the buffers and then
> one gets an allowance of zero, but uses its right to take one pin
> anyway to make progress, and there isn't one.

I think that may be ok. If there are no unpinned buffers, it seems to be
expected that starting a new stream will fail.  That's not the same as - as we
did previously - failing in a read stream that did start successfully, because
we issue large IOs even though there are only a small number of unpinned
buffers.


> I wonder if we should use temp_buffers - 100?  Then leave the minimum GUC
> value at 100 still, so you have an easy way to test with 0, 1,
> ... additional buffers?

I think that just makes it harder to test the exhaustion scenario without
really fixing anything?


> 2.  It shouldn't give up issuing random advice immediately after a
> jump, or it could stall on (say) the second 128kB of a 256kB
> sequential chunk (ie the strace you showed on the BHS thread).  It
> only makes sense to assume kernel readahead takes over once you've
> actually *read* sequentially.  In practice this makes it a lot more
> aggressive about advice (like the BHS code in master): it only gives
> up if the whole look-ahead window is sequential.


> 3.  Change the distance algorithm to care only about hits and misses,
> not sequential heuristics.  It made at least some sense before, but it
> doesn't make sense for AIO, and even in synchronous mode it means that
> you hit random jumps with insufficient look-ahead, so I don't think we
> should keep it.
> 
> I also realised that the sequential heuristics are confused by that
> hidden trailing block thing, so in contrived pattern testing with
> hit-miss-hit-miss... would be considered sequential, and even if you
> fix that (the forwarding patches above fix that), an exact
> hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
> (double, decrement, double, ... might be worth waiting a bit longer
> before decrementing, IDK.
> 
> I'll rebase the others and post soon.



> +
> +/* see GetAdditionalPinLimit() */
> +uint32
> +GetAdditionalLocalPinLimit(void)
> +{
> +    Assert(NLocalPinnedBuffers <= num_temp_buffers);
> +    return num_temp_buffers - NLocalPinnedBuffers;
> +}

This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
0. Which makes some sense, pinning an additional buffer will always
fail. Perhaps worth calling out though?


>  static void
>  read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
>  {
>      while (stream->ios_in_progress < stream->max_ios &&
> -           stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
> +           ((stream->pinned_buffers == 0 && stream->distance > 0) ||
> +            stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))

What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
mean? And when would it be true when the pre-existing condition wouldn't
already be true?


>      {
>          BlockNumber blocknum;
>          int16        buffer_index;
>          void       *per_buffer_data;
>  
> -        if (stream->pending_read_nblocks == io_combine_limit)
> +        /* If have a pending read that can't be extended, start it now. */
> +        Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
> +               stream->max_pinned_buffers);
> +        if (stream->pending_read_nblocks == io_combine_limit ||
> +            (stream->pinned_buffers == 0 &&
> +             stream->pending_read_nblocks == stream->max_pinned_buffers))
>          {
>              read_stream_start_pending_read(stream, suppress_advice);
>              suppress_advice = false;
> @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
>          /* We have to start the pending read before we can build another. */
>          while (stream->pending_read_nblocks > 0)
>          {
> -            read_stream_start_pending_read(stream, suppress_advice);
> -            suppress_advice = false;
> -            if (stream->ios_in_progress == stream->max_ios)
> +            if (!read_stream_start_pending_read(stream, suppress_advice) ||
> +                stream->ios_in_progress == stream->max_ios)
>              {
> -                /* And we've hit the limit.  Rewind, and stop here. */
> +                /* And we've hit a buffer or I/O limit.  Rewind and wait. */
>                  read_stream_unget_block(stream, blocknum);
>                  return;
>              }
> +
> +            suppress_advice = false;
>          }

If read_stream_start_pending_read() returns false because we hit the pin
limit, does it really help to call read_stream_unget_block()? IIUC that'll
defer one block for later - but what about the other buffers in a multi-block
read?



> @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
>      else
>          Assert(stream->next_buffer_index == stream->oldest_buffer_index);
>  
> -    /*
> -     * If advice hasn't been suppressed, this system supports it, and this
> -     * isn't a strictly sequential pattern, then we'll issue advice.
> -     */
> -    if (!suppress_advice &&
> -        stream->advice_enabled &&
> -        stream->pending_read_blocknum != stream->seq_blocknum)
> +    /* Do we need to issue read-ahead advice? */
> +    flags = 0;
> +    if (stream->advice_enabled)
> +    {
>          flags = READ_BUFFERS_ISSUE_ADVICE;
> -    else
> -        flags = 0;
> +
> +        if (stream->pending_read_blocknum == stream->seq_blocknum)
> +        {
> +            /*
> +             * Suppress advice if our WaitReadBuffers() calls have caught up
> +             * with the first advice we issued for this sequential run.
> +             */
> +            if (stream->seq_start == InvalidBlockNumber)
> +                suppress_advice = true;
> +        }
> +        else
> +        {
> +            /* Random jump, so start a new sequential run. */
> +            stream->seq_start = stream->pending_read_blocknum;
> +        }
> +
> +        if (suppress_advice)
> +            flags = 0;
> +    }

Seems a bit confusing to first set
  flags = READ_BUFFERS_ISSUE_ADVICE
to then later unset it again. Maybe just set it in if (!suppress_advice)?



>       * Skip the initial ramp-up phase if the caller says we're going to be
> @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
>              distance = stream->distance * 2;
>              distance = Min(distance, stream->max_pinned_buffers);
>              stream->distance = distance;
> +
> +            /*
> +             * If we've caught up with the first advice issued for the current
> +             * sequential run, cancel further advice until the next random
> +             * jump.  The kernel should be able to see the pattern now that
> +             * we're issuing sequential preadv() calls.
> +             */
> +            if (stream->ios[io_index].op.blocknum == stream->seq_start)
> +                stream->seq_start = InvalidBlockNumber;

So stream->seq_start doesn't really denote the start of sequentialness, it
denotes up to where the caller needs to process before we disable sequential
access. Maybe add a comment to it and rename it to something like
->seq_until_processed?

Other than this the approach seems to make sense!

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nico Williams
Date:
Subject: Re: protocol support for labels
Next
From: Marcos Pegoraro
Date:
Subject: Re: Document NULL