Thread: Re: Some read stream improvements

Re: Some read stream improvements

From
Andres Freund
Date:
Hi,

On 2025-02-17 17:55:09 +1300, Thomas Munro wrote:
> 0004-Respect-pin-limits-accurately-in-read_stream.c.patch
> 
> The current coding only computes the remaining "fair share" of the
> buffer pool for this backend at stream initialisation.  It's hard, but
> not impossible, to get one backend to pin more than 1/max_connections
> of the buffer pool (the intended cap), when using multiple streams at
> the same time in one backend.  This patch moves the time of check to
> the time of use, so it respects the limit strictly.  I avoid adding
> any changes to the fast path for all-cached streams, which only pin
> one buffer at a time.

I was working on expanding tests for AIO and as part of that wrote a test for
temp tables -- our coverage is fairly awful, there were many times during AIO
development where I knew I had trivially reachable temp table specific bugs
but all tests passed.

The test for that does trigger the problem described above and is fixed by the
patches in this thread (which I included in the other thread):

https://postgr.es/m/knr4aazlaa4nj3xnpe4tu6plwayovzxhmteatcpry2j6a6kc4v%40aonkl53s2ecs

Just linked instead of attached to not trigger cfbot.

Greetings,

Andres Freund



Re: Some read stream improvements

From
Thomas Munro
Date:
On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
> I was working on expanding tests for AIO and as part of that wrote a test for
> temp tables -- our coverage is fairly awful, there were many times during AIO
> development where I knew I had trivially reachable temp table specific bugs
> but all tests passed.
>
> The test for that does trigger the problem described above and is fixed by the
> patches in this thread (which I included in the other thread):

Thanks.  Alright, I'm assuming that you don't have any objections to
the way I restyled that API, so I'm going to go ahead and push some of
these shortly, and then follow up with a few newer patches that
simplify and improve the look-ahead and advice control.  More very
soon.



Re: Some read stream improvements

From
Thomas Munro
Date:
On Thu, Feb 27, 2025 at 11:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
> > I was working on expanding tests for AIO and as part of that wrote a test for
> > temp tables -- our coverage is fairly awful, there were many times during AIO
> > development where I knew I had trivially reachable temp table specific bugs
> > but all tests passed.
> >
> > The test for that does trigger the problem described above and is fixed by the
> > patches in this thread (which I included in the other thread):
>
> Thanks.  Alright, I'm assuming that you don't have any objections to
> the way I restyled that API, so I'm going to go ahead and push some of
> these shortly, and then follow up with a few newer patches that
> simplify and improve the look-ahead and advice control.  More very
> soon.

Ugh, I realised in another round of self-review that that version
could exceed the soft limit by a small amount if the registered
callback pins more buffers underneath it, so not pushed yet.  I think
I see how to fix that (namely the alternative design that a comment
already contemplated), more soon...



Re: Some read stream improvements

From
Andres Freund
Date:
On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
> > I was working on expanding tests for AIO and as part of that wrote a test for
> > temp tables -- our coverage is fairly awful, there were many times during AIO
> > development where I knew I had trivially reachable temp table specific bugs
> > but all tests passed.
> >
> > The test for that does trigger the problem described above and is fixed by the
> > patches in this thread (which I included in the other thread):
> 
> Thanks.  Alright, I'm assuming that you don't have any objections to
> the way I restyled that API, so I'm going to go ahead and push some of
> these shortly, and then follow up with a few newer patches that
> simplify and improve the look-ahead and advice control.  More very
> soon.

Indeed, no objections, rather the opposite. Thanks!

Greetings,

Andres Freund



Re: Some read stream improvements

From
Thomas Munro
Date:
On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote:
> On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
> > > I was working on expanding tests for AIO and as part of that wrote a test for
> > > temp tables -- our coverage is fairly awful, there were many times during AIO
> > > development where I knew I had trivially reachable temp table specific bugs
> > > but all tests passed.
> > >
> > > The test for that does trigger the problem described above and is fixed by the
> > > patches in this thread (which I included in the other thread):

Here is a subset of those patches again:

1.  Per-backend buffer limit, take III.  Now the check is in
read_stream_start_pending_read() so TOC == TOU.

Annoyingly, test cases like the one below still fail, despite
following the rules.  The other streams eat all the buffers and then
one gets an allowance of zero, but uses its right to take one pin
anyway to make progress, and there isn't one.  I wonder if we should
use temp_buffers - 100?  Then leave the minimum GUC value at 100
still, so you have an easy way to test with 0, 1, ... additional
buffers?

2.  It shouldn't give up issuing random advice immediately after a
jump, or it could stall on (say) the second 128kB of a 256kB
sequential chunk (ie the strace you showed on the BHS thread).  It
only makes sense to assume kernel readahead takes over once you've
actually *read* sequentially.  In practice this makes it a lot more
aggressive about advice (like the BHS code in master): it only gives
up if the whole look-ahead window is sequential.

3.  Change the distance algorithm to care only about hits and misses,
not sequential heuristics.  It made at least some sense before, but it
doesn't make sense for AIO, and even in synchronous mode it means that
you hit random jumps with insufficient look-ahead, so I don't think we
should keep it.

I also realised that the sequential heuristics are confused by that
hidden trailing block thing, so in contrived pattern testing with
hit-miss-hit-miss... would be considered sequential, and even if you
fix that (the forwarding patches above fix that), an exact
hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
(double, decrement, double, ... might be worth waiting a bit longer
before decrementing, IDK.

I'll rebase the others and post soon.


set io_combine_limit = 32;
set temp_buffers = 100;

create temp table t1 as select generate_series(1, 10000);
create temp table t2 as select generate_series(1, 10000);
create temp table t3 as select generate_series(1, 10000);
create temp table t4 as select generate_series(1, 10000);
create temp table t5 as select generate_series(1, 10000);

do
$$
declare
  c1 cursor for select * from t1;
  c2 cursor for select * from t2;
  c3 cursor for select * from t3;
  c4 cursor for select * from t4;
  c5 cursor for select * from t5;
  x record;
begin
  open c1;
  open c2;
  open c3;
  open c4;
  open c5;
  loop
    fetch next from c1 into x;
    exit when not found;
    fetch next from c2 into x;
    exit when not found;
    fetch next from c3 into x;
    exit when not found;
    fetch next from c4 into x;
    exit when not found;
    fetch next from c5 into x;
    exit when not found;
  end loop;
end;
$$;

Attachment

Re: Some read stream improvements

From
Andres Freund
Date:
Hi,

On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
> On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I was working on expanding tests for AIO and as part of that wrote a test for
> > > > temp tables -- our coverage is fairly awful, there were many times during AIO
> > > > development where I knew I had trivially reachable temp table specific bugs
> > > > but all tests passed.
> > > >
> > > > The test for that does trigger the problem described above and is fixed by the
> > > > patches in this thread (which I included in the other thread):
> 
> Here is a subset of those patches again:
> 
> 1.  Per-backend buffer limit, take III.  Now the check is in
> read_stream_start_pending_read() so TOC == TOU.
> 
> Annoyingly, test cases like the one below still fail, despite
> following the rules.  The other streams eat all the buffers and then
> one gets an allowance of zero, but uses its right to take one pin
> anyway to make progress, and there isn't one.

I think that may be ok. If there are no unpinned buffers, it seems to be
expected that starting a new stream will fail.  That's not the same as - as we
did previously - failing in a read stream that did start successfully, because
we issue large IOs even though there are only a small number of unpinned
buffers.


> I wonder if we should use temp_buffers - 100?  Then leave the minimum GUC
> value at 100 still, so you have an easy way to test with 0, 1,
> ... additional buffers?

I think that just makes it harder to test the exhaustion scenario without
really fixing anything?


> 2.  It shouldn't give up issuing random advice immediately after a
> jump, or it could stall on (say) the second 128kB of a 256kB
> sequential chunk (ie the strace you showed on the BHS thread).  It
> only makes sense to assume kernel readahead takes over once you've
> actually *read* sequentially.  In practice this makes it a lot more
> aggressive about advice (like the BHS code in master): it only gives
> up if the whole look-ahead window is sequential.


> 3.  Change the distance algorithm to care only about hits and misses,
> not sequential heuristics.  It made at least some sense before, but it
> doesn't make sense for AIO, and even in synchronous mode it means that
> you hit random jumps with insufficient look-ahead, so I don't think we
> should keep it.
> 
> I also realised that the sequential heuristics are confused by that
> hidden trailing block thing, so in contrived pattern testing with
> hit-miss-hit-miss... would be considered sequential, and even if you
> fix that (the forwarding patches above fix that), an exact
> hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
> (double, decrement, double, ... might be worth waiting a bit longer
> before decrementing, IDK.
> 
> I'll rebase the others and post soon.



> +
> +/* see GetAdditionalPinLimit() */
> +uint32
> +GetAdditionalLocalPinLimit(void)
> +{
> +    Assert(NLocalPinnedBuffers <= num_temp_buffers);
> +    return num_temp_buffers - NLocalPinnedBuffers;
> +}

This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
0. Which makes some sense, pinning an additional buffer will always
fail. Perhaps worth calling out though?


>  static void
>  read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
>  {
>      while (stream->ios_in_progress < stream->max_ios &&
> -           stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
> +           ((stream->pinned_buffers == 0 && stream->distance > 0) ||
> +            stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))

What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
mean? And when would it be true when the pre-existing condition wouldn't
already be true?


>      {
>          BlockNumber blocknum;
>          int16        buffer_index;
>          void       *per_buffer_data;
>  
> -        if (stream->pending_read_nblocks == io_combine_limit)
> +        /* If have a pending read that can't be extended, start it now. */
> +        Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
> +               stream->max_pinned_buffers);
> +        if (stream->pending_read_nblocks == io_combine_limit ||
> +            (stream->pinned_buffers == 0 &&
> +             stream->pending_read_nblocks == stream->max_pinned_buffers))
>          {
>              read_stream_start_pending_read(stream, suppress_advice);
>              suppress_advice = false;
> @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
>          /* We have to start the pending read before we can build another. */
>          while (stream->pending_read_nblocks > 0)
>          {
> -            read_stream_start_pending_read(stream, suppress_advice);
> -            suppress_advice = false;
> -            if (stream->ios_in_progress == stream->max_ios)
> +            if (!read_stream_start_pending_read(stream, suppress_advice) ||
> +                stream->ios_in_progress == stream->max_ios)
>              {
> -                /* And we've hit the limit.  Rewind, and stop here. */
> +                /* And we've hit a buffer or I/O limit.  Rewind and wait. */
>                  read_stream_unget_block(stream, blocknum);
>                  return;
>              }
> +
> +            suppress_advice = false;
>          }

If read_stream_start_pending_read() returns false because we hit the pin
limit, does it really help to call read_stream_unget_block()? IIUC that'll
defer one block for later - but what about the other buffers in a multi-block
read?



> @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
>      else
>          Assert(stream->next_buffer_index == stream->oldest_buffer_index);
>  
> -    /*
> -     * If advice hasn't been suppressed, this system supports it, and this
> -     * isn't a strictly sequential pattern, then we'll issue advice.
> -     */
> -    if (!suppress_advice &&
> -        stream->advice_enabled &&
> -        stream->pending_read_blocknum != stream->seq_blocknum)
> +    /* Do we need to issue read-ahead advice? */
> +    flags = 0;
> +    if (stream->advice_enabled)
> +    {
>          flags = READ_BUFFERS_ISSUE_ADVICE;
> -    else
> -        flags = 0;
> +
> +        if (stream->pending_read_blocknum == stream->seq_blocknum)
> +        {
> +            /*
> +             * Suppress advice if our WaitReadBuffers() calls have caught up
> +             * with the first advice we issued for this sequential run.
> +             */
> +            if (stream->seq_start == InvalidBlockNumber)
> +                suppress_advice = true;
> +        }
> +        else
> +        {
> +            /* Random jump, so start a new sequential run. */
> +            stream->seq_start = stream->pending_read_blocknum;
> +        }
> +
> +        if (suppress_advice)
> +            flags = 0;
> +    }

Seems a bit confusing to first set
  flags = READ_BUFFERS_ISSUE_ADVICE
to then later unset it again. Maybe just set it in if (!suppress_advice)?



>       * Skip the initial ramp-up phase if the caller says we're going to be
> @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
>              distance = stream->distance * 2;
>              distance = Min(distance, stream->max_pinned_buffers);
>              stream->distance = distance;
> +
> +            /*
> +             * If we've caught up with the first advice issued for the current
> +             * sequential run, cancel further advice until the next random
> +             * jump.  The kernel should be able to see the pattern now that
> +             * we're issuing sequential preadv() calls.
> +             */
> +            if (stream->ios[io_index].op.blocknum == stream->seq_start)
> +                stream->seq_start = InvalidBlockNumber;

So stream->seq_start doesn't really denote the start of sequentialness, it
denotes up to where the caller needs to process before we disable sequential
access. Maybe add a comment to it and rename it to something like
->seq_until_processed?

Other than this the approach seems to make sense!

Greetings,

Andres Freund



Re: Some read stream improvements

From
Thomas Munro
Date:
On Wed, Mar 12, 2025 at 8:29 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
> > On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > I wonder if we should use temp_buffers - 100?  Then leave the minimum GUC
> > value at 100 still, so you have an easy way to test with 0, 1,
> > ... additional buffers?
>
> I think that just makes it harder to test the exhaustion scenario without
> really fixing anything?

Not sure I agree yet but I'll come back to this in a bit (I think
there might be something worth thinking about some more here but it's
not in the way of committing these patches).

> > +/* see GetAdditionalPinLimit() */
> > +uint32
> > +GetAdditionalLocalPinLimit(void)
> > +{
> > +     Assert(NLocalPinnedBuffers <= num_temp_buffers);
> > +     return num_temp_buffers - NLocalPinnedBuffers;
> > +}
>
> This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
> 0. Which makes some sense, pinning an additional buffer will always
> fail. Perhaps worth calling out though?

No, GetAdditionalPinLimit() works that way too.  It's only
LimitAdditional[Local]PinLimit() that has the special "you can always
have one" logic that I needed to escape from.  But yes I should
highlight that in a comment: done above GetAdditionalPinLimit().
GetAdditionalLocalPinLimit() just tells you to see the shared version.

> >  static void
> >  read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
> >  {
> >       while (stream->ios_in_progress < stream->max_ios &&
> > -                stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
> > +                ((stream->pinned_buffers == 0 && stream->distance > 0) ||
> > +                     stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))
>
> What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
> mean? And when would it be true when the pre-existing condition wouldn't
> already be true?

Well the reason is basically that the distance can get chomped lower
than pending_read_nblocks if you recently hit the pin limit, and we
have to be able to start your I/O anyway (at least one block of it) to
make progress.  But I realised that was a stupid way to handle that,
and actually I had screwed up further down, and the right way is just:

@@ -382,15 +435,25 @@ read_stream_look_ahead(ReadStream *stream, bool
suppress_advice)
         * io_combine_limit size once more buffers have been consumed.  However,
         * if we've already reached io_combine_limit, or we've reached the
         * distance limit and there isn't anything pinned yet, or the
callback has
-        * signaled end-of-stream, we start the read immediately.
+        * signaled end-of-stream, we start the read immediately.  Note that the
+        * pending read could even exceed the distance goal, if the latter was
+        * reduced on buffer limit exhaustion.
         */
        if (stream->pending_read_nblocks > 0 &&
                (stream->pending_read_nblocks == stream->io_combine_limit ||
-                (stream->pending_read_nblocks == stream->distance &&
+                (stream->pending_read_nblocks >= stream->distance &&
                  stream->pinned_buffers == 0) ||
                 stream->distance == 0) &&
                stream->ios_in_progress < stream->max_ios)
                read_stream_start_pending_read(stream, suppress_advice);

> >       {
> >               BlockNumber blocknum;
> >               int16           buffer_index;
> >               void       *per_buffer_data;
> >
> > -             if (stream->pending_read_nblocks == io_combine_limit)
> > +             /* If have a pending read that can't be extended, start it now. */
> > +             Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
> > +                        stream->max_pinned_buffers);
> > +             if (stream->pending_read_nblocks == io_combine_limit ||
> > +                     (stream->pinned_buffers == 0 &&
> > +                      stream->pending_read_nblocks == stream->max_pinned_buffers))
> >               {
> >                       read_stream_start_pending_read(stream, suppress_advice);
> >                       suppress_advice = false;
> > @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
> >               /* We have to start the pending read before we can build another. */
> >               while (stream->pending_read_nblocks > 0)
> >               {
> > -                     read_stream_start_pending_read(stream, suppress_advice);
> > -                     suppress_advice = false;
> > -                     if (stream->ios_in_progress == stream->max_ios)
> > +                     if (!read_stream_start_pending_read(stream, suppress_advice) ||
> > +                             stream->ios_in_progress == stream->max_ios)
> >                       {
> > -                             /* And we've hit the limit.  Rewind, and stop here. */
> > +                             /* And we've hit a buffer or I/O limit.  Rewind and wait. */
> >                               read_stream_unget_block(stream, blocknum);
> >                               return;
> >                       }
> > +
> > +                     suppress_advice = false;
> >               }
>
> If read_stream_start_pending_read() returns false because we hit the pin
> limit, does it really help to call read_stream_unget_block()? IIUC that'll
> defer one block for later - but what about the other buffers in a multi-block
> read?

They are already represented by (pending_read_blocknum,
pending_read_nblocks).  We were unable to append this block number to
the pending read because it's already full-sized or the newly acquired
block number isn't consecutive, but we were also unable to start the
pending read to get it out of the way.  That was a pre-existing edge
case that you could already hit if it turned out to be a short read:
ie the remaining part of the pending read is still in your way, and
now you've reached stream->max_ios so you can't start it.  So we had
to put it aside for later.

This change piggy-backs on that approach for buffer starvation:
read_stream_start_buffers() can now decline to start even a partial
read.  In fact it usually declines, unless it is forced to accept a
short read because stream->pinned_buffers == 0 (ie, we have to do
whatever we can to make progress).

It's OK that pending_read_nblocks exceeds what we can start right now,
we still remember it, and we can always start it one block at a time
if it comes to it.  Make sense?

> > @@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
> >       else
> >               Assert(stream->next_buffer_index == stream->oldest_buffer_index);
> >
> > -     /*
> > -      * If advice hasn't been suppressed, this system supports it, and this
> > -      * isn't a strictly sequential pattern, then we'll issue advice.
> > -      */
> > -     if (!suppress_advice &&
> > -             stream->advice_enabled &&
> > -             stream->pending_read_blocknum != stream->seq_blocknum)
> > +     /* Do we need to issue read-ahead advice? */
> > +     flags = 0;
> > +     if (stream->advice_enabled)
> > +     {
> >               flags = READ_BUFFERS_ISSUE_ADVICE;
> > -     else
> > -             flags = 0;
> > +
> > +             if (stream->pending_read_blocknum == stream->seq_blocknum)
> > +             {
> > +                     /*
> > +                      * Suppress advice if our WaitReadBuffers() calls have caught up
> > +                      * with the first advice we issued for this sequential run.
> > +                      */
> > +                     if (stream->seq_start == InvalidBlockNumber)
> > +                             suppress_advice = true;
> > +             }
> > +             else
> > +             {
> > +                     /* Random jump, so start a new sequential run. */
> > +                     stream->seq_start = stream->pending_read_blocknum;
> > +             }
> > +
> > +             if (suppress_advice)
> > +                     flags = 0;
> > +     }
>
> Seems a bit confusing to first set
>   flags = READ_BUFFERS_ISSUE_ADVICE
> to then later unset it again. Maybe just set it in if (!suppress_advice)?

Yeah that was a bit too tangly.  I found a better expression of that
logic, which also removed that annoying suppress_advice function
argument.  I hope this is much clearer.

> >        * Skip the initial ramp-up phase if the caller says we're going to be
> > @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
> >                       distance = stream->distance * 2;
> >                       distance = Min(distance, stream->max_pinned_buffers);
> >                       stream->distance = distance;
> > +
> > +                     /*
> > +                      * If we've caught up with the first advice issued for the current
> > +                      * sequential run, cancel further advice until the next random
> > +                      * jump.  The kernel should be able to see the pattern now that
> > +                      * we're issuing sequential preadv() calls.
> > +                      */
> > +                     if (stream->ios[io_index].op.blocknum == stream->seq_start)
> > +                             stream->seq_start = InvalidBlockNumber;
>
> So stream->seq_start doesn't really denote the start of sequentialness, it
> denotes up to where the caller needs to process before we disable sequential
> access. Maybe add a comment to it and rename it to something like
> ->seq_until_processed?

WFM.

> Other than this the approach seems to make sense!

Cool, so I'm planning to start pushing the earlier ones tomorrow.
Here also are the buffer forwarding ones, rebased on top.

Here's some strace art showing the old and new advice for patch 0003.
I traced ANALYZE io_combine_limit=8 and used different
default_statistics_targets to find interesting test cases.  The
"bracket" on the right shows a sequential range of blocks.  Master
only calls fadvise once per sequential chunk:

fadvise ●──────────────────────╮││                            3 0.000006
                               ││╰─► pread   1  676..676      2 0.000007
fadvise ●─────────────────────╮││                             3 0.000006
                              ││╰──► pread   1  678..678      2 0.000007
fadvise ●────────────────────╮││                              3 0.000007
                             ││╰───► pread   3  680..682      2 0.000031
                             │╰────► pread   6  684..689      1 0.000015
                             ╰─────► pread   8  691..698  ─╮  0 0.000018
                                     pread   8  699..706   │  0 0.000016
fadvise ●────────────────────────╮                         │  1 0.000007
                                 │   pread   8  707..714   │  1 0.000019
                                 │   pread   7  715..721  ─╯  1 0.000017
                                 ╰─► pread   8  723..730  ─╮  0 0.000016
                                     pread   8  731..738   │  0 0.000019
fadvise ●────────────────────────╮                         │  1 0.000007
                                 │   pread   8  739..746   │  1 0.000018
                                 │   pread   5  747..751  ─╯  1 0.000013

The patch can call three times when that's the configured I/O
concurrency level, because that controls when the pread() calls catch
up with the first block:

fadvise ●────────────────────╮││                              3 0.000007
                             ││╰───► pread   2  255..256      2 0.000014
fadvise ●───────────────────╮││                               3 0.000007
                            ││╰────► pread   8  258..265  ─╮  2 0.000035
                            │╰─────► preadv  8  266..273   │  1 0.000021
                            ╰──────► pread   8  274..281   │  0 0.000017
fadvise ●────────────────────────╮                         │  1 0.000007
                                 │   pread   8  282..289   │  1 0.000017
fadvise ●───────────────────────╮│                         │  2 0.000007
                                ││   pread   6  290..295  ─╯  2 0.000015
fadvise ●──────────────────────╮││                            3 0.000007
                               ││╰─► pread   8  297..304  ─╮  2 0.000015
fadvise ●─────────────────────╮││                          │  3 0.000007
                              ││╰──► pread   8  305..312   │  2 0.000017


Purely sequential streams still get none:

                                     pread   1    0..0    ─╮  0 0.000016
                                     pread   2    1..2     │  0 0.000014
                                     pread   4    3..6     │  0 0.000021
                                     pread   8    7..14    │  0 0.000034
                                          ... blah blah blah ...
                                     pread   8 4455..4462  │  0 0.000029
                                     pread   8 4463..4470  │  0 0.000026
                                     pread   8 4471..4478  │  0 0.000020
                                     pread   1 4479..4479 ─╯  0 0.000010

Attachment

Re: Some read stream improvements

From
Thomas Munro
Date:
I have pushed the new pin limit patches, after some more testing and
copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a
change I'd like to make but it didn't belong in this commit) and
dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping
comprehension and isn't even true for the local buffer version (which
I still think might be an issue, will come back to that, but again it
seemed independent).

For the record, here's a way to see 92fc6856^ or v17 misbehave and pin
too many buffers without involving any other proposed patches, only
v17 streaming seq scan: with shared_buffers=16MB, max_connections=4,
which gives MaxProportionalBuffers == 44, the attached shows three
cursors each pinning io_combine_limit = 32 buffers for a total of 96
at peak.  That's because each cursor starts a stream and sees (the
only time it would check) that it is allowed 44, and then we fetch in
round-robin order until they all ramp up to full I/O size.  In v17 we
can't really do much worse than that and it requires pretty contrived
settings and workload for it to be a real issue AFAIK but obviously
and hopefully we soon will eg BHS and more.

Attachment

Re: Some read stream improvements

From
Andres Freund
Date:
Hi,

On 2025-03-14 22:03:15 +1300, Thomas Munro wrote:
> I have pushed the new pin limit patches, after some more testing and
> copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a
> change I'd like to make but it didn't belong in this commit) and
> dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping
> comprehension and isn't even true for the local buffer version (which
> I still think might be an issue, will come back to that, but again it
> seemed independent).

Something odd:

I was looking to push a test that increases localbuf.c coverage, which passed
with a previous version of these changes.  However, it failed, and it did so
on FreeBSD alone.  The same test doesn't fail when tested together with the
rest of the AIO work.

https://cirrus-ci.com/build/5951090857869312
https://cirrus-ci.com/task/6177637229395968

I do not understand what could be OS dependent here. I tried to build with
exactly the same options as CI on freebsd, but couldn't repro the issue.

It's perhaps worth noting that this failed even before my recent localbuf:
commits.


I ran CI with a patch that adds a bunch of debug information and just runs the
relevant test:
https://cirrus-ci.com/task/6516935619248128

2025-03-17 16:19:14.270 UTC client backend[5526] pg_regress/temp LOG:  statement: SELECT count(*), max(a) max_a, min(a)
min_a,max(cnt) max_cnt FROM test_temp;
 
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG:  NlocalPinnedBuffers=98++
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG:  NlocalPinnedBuffers=99--
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG:  pgsr create 0xde34f1f57f8: test_temp,
max_pinned:100, NLocPin: 98
 
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG:  pgsr 0xde34f1f57f8: buffer_limit: 2,
pinned_buffers:0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 98
 

comparing that with a local run:

2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp LOG:  statement: SELECT count(*), max(a) max_a,
min(a)min_a, max(cnt) max_cnt FROM test_temp;
 
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG:  pgsr create 0x56029555cad8: test_temp,
max_pinned:100, NLocPin: 97
 
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG:  pgsr 0x56029555cad8: buffer_limit: 3,
pinned_buffers:0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 97
 
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG:  pgsr 0x56029555cad8: StartReadBuffers:
wait:0, pinned: 0 += 1, distance: 1
 
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG:  pgsr 0x56029555cad8: buffer_limit: 3,
pinned_buffers:0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 97
 
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG:  pgsr 0x56029555cad8: StartReadBuffers:
wait:0, pinned: 0 += 1, distance: 1
 

So one thing is that the pin count differs by 1 at the start of the scan. No
idea why.


But knowing that I was able to repro the issue locally too, by just ensuring
the pin count is one higher by the start of the query.


I still don't know what drives the difference between freebsd and the rest,
but IIUC the reason this fails is just that we are holding too many buffers
pinned, due to some buffers being pinned outside of read_stream.c.


Greetings,

Andres Freund



Re: Some read stream improvements

From
Thomas Munro
Date:
On Tue, Mar 18, 2025 at 5:56 AM Andres Freund <andres@anarazel.de> wrote:
> So one thing is that the pin count differs by 1 at the start of the scan. No
> idea why.
>
> I still don't know what drives the difference between freebsd and the rest,
> but IIUC the reason this fails is just that we are holding too many buffers
> pinned, due to some buffers being pinned outside of read_stream.c.

I couldn't reproduce this on my local FreeBSD box, but I think I see
one part of the problem: the cursor a few lines up has a stream with a
higher distance holding a couple of pins.  Not sure how the local
buffer pool got into a state that caused that if it isn't doing the
same on other machines, but anyway, if I read the test right you
intend to pin strictly one page per cursor, so I tried saying so
explicitly:

-       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM
test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+       query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM
test_temp WHERE ctid between '(%s, 1)'::tid and '(%s, 9999)'::tid $q$,
cursorname, i, i);

That passed on the FreeBSD CI task.



Re: Some read stream improvements

From
Andres Freund
Date:
Hi,

On 2025-03-14 22:03:15 +1300, Thomas Munro wrote:
> I have pushed the new pin limit patches, after some more testing and
> copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a
> change I'd like to make but it didn't belong in this commit) and
> dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping
> comprehension and isn't even true for the local buffer version (which
> I still think might be an issue, will come back to that, but again it
> seemed independent).

I found an, easy to fix, issue in read_stream.c. If the limit returned by
GetAdditionalPinLimit() is very large, the buffer_limit variable in
read_stream_start_pending_read() can overflow.

While the code is careful to limit buffer_limit PG_INT16_MAX, we subsequently
add the number of forwarded buffers:

    if (stream->temporary)
        buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX);
    else
        buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX);
    Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
    buffer_limit += stream->forwarded_buffers;



I think there's a second, theoretical, overflow issue shortly after:

        /* Shrink distance: no more look-ahead until buffers are released. */
        new_distance = stream->pinned_buffers + buffer_limit;
        if (stream->distance > new_distance)
            stream->distance = new_distance;

while that was hit in the case I was looking at, it was only hit because
buffer_limit had already wrapped around into the negative.  I don't think
nblocks can be big enough to really hit this.


I don't actually see any reason for buffer_limit to be a 16bit quantity? It's
just to clamp things down, right?

Greetings,

Andres Freund



Re: Some read stream improvements

From
Thomas Munro
Date:
On Thu, Apr 3, 2025 at 11:17 AM Andres Freund <andres@anarazel.de> wrote:
> I don't actually see any reason for buffer_limit to be a 16bit quantity? It's
> just to clamp things down, right?

Ugh.  It might be worth just flipping this whole thing over to ints,
let me look into that...