Thread: [MASSMAIL]Streaming relation data out of order

[MASSMAIL]Streaming relation data out of order

From

Thomas Munro

Date:

09 April 2024, 04:18:28

Hi

This idea is due to Robert Haas, who complained that he feared that
the streaming I/O API already worked like this.  It doesn't, but it
could!  Here is a concept patch to try it out.

Normally, read_stream_next_buffer() spits out buffers in the order
that the user's callback generated block numbers.  This option says
that any order would be OK.

I had been assuming that this sort of thing would come with real
asynchronous I/O: if I/Os complete out of order and the caller
explicitly said she doesn't care about block order, we can stream them
as the completion events arrive.  But even with synchronous I/O, we
could stream already-in-cache blocks before ones that require I/O.
Letting the caller chew on blocks that are already available maximises
the time between fadvise() and preadv() for misses, which minimises
the likelihood that the process will have to go to "D" sleep.

The patch is pretty trivial: if started with the
READ_STREAM_OUT_OF_ORDER flag, "hit" buffers are allowed to jump in
front of "miss" buffers in the queue.  The attached coding may not be
optimal, it's just a proof of concept.

ANALYZE benefits from this, for example, with certain kinds of
partially cached initial states and fast disks (?).  I'm not sure how
generally useful it is though.  I'm posting it because I wonder if it
could be interesting for the streaming bitmap heapscan project, and I
wonder how many other things don't care about the order.

Attachment

0001-Add-READ_STREAM_OUT_OF_ORDER.patch

Re: Streaming relation data out of order

From

Robert Haas

Date:

09 April 2024, 14:37:33

On Tue, Apr 9, 2024 at 12:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> This idea is due to Robert Haas, who complained that he feared that
> the streaming I/O API already worked like this.  It doesn't, but it
> could!  Here is a concept patch to try it out.

Oh, no, it's all my fault!

My favorite part of this patch is the comment added to read_stream.h,
which IMHO makes things a whole lot clearer.

I have no opinion on the rest of it; I don't understand this code.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Streaming relation data out of order

From

Thomas Munro

Date:

09 April, 10:17:00

Here is a more fleshed out version of this concept patch, now that we
have lots of streams wired up to try it with.  Bitmap Heap Scan seemed
to be a good candidate.

postgres=# create table t (a int unique, b int unique);
CREATE TABLE
postgres=# insert into t select generate_series(1, 100000),
generate_series(1, 100000);
INSERT 0 100000
postgres=# select ctid from t where (a between 1 and 8000 or b between
1 and 8000) and ctid::text like '%,1)';
  ctid
--------
 (0,1)
 (1,1)
 (2,1)
... pages in order ...
 (33,1)
 (34,1)
 (35,1)
(36 rows)
postgres=# select pg_buffercache_evict(bufferid) from pg_buffercache
where relfilenode = pg_relation_filenode('t'::regclass) and
relblocknumber % 3 != 2;
... some pages being kicked out to create a mixed hit/miss scan ...
postgres=# select ctid from t where (a between 1 and 8000 or b between
1 and 8000) and ctid::text like '%,1)';
  ctid
--------
 (0,1)
 (1,1)
 (2,1)
... order during early distance ramp-up ...
 (12,1)
 (20,1) <-- chaos reigns
 (23,1)
 (17,1)
 (26,1)
 (13,1)
 (29,1)
...
 (34,1)
(36 rows)

One weird issue, not just with reordering, is that read_stream.c's
distance cools off too easily with some simple test patterns.  Start
at 1, double on miss, subtract one on hit, repeat, and you can be
trapped alternating between 1 and 2 when you'd certainly benefit from
IO concurrency and also reordering.  It may need a longer memory.
That seemed like too artificial a problem to worry about for v18, but
it's why I used relblocknumber % 3 != 2 and not eg relblocknumber % 2
!= 1 above.

Attachment

Re: Streaming relation data out of order

From

Andres Freund

Date:

09 April, 22:35:43

Hi,

On 2025-04-09 19:17:00 +1200, Thomas Munro wrote:
> One weird issue, not just with reordering, is that read_stream.c's
> distance cools off too easily with some simple test patterns.  Start
> at 1, double on miss, subtract one on hit, repeat, and you can be
> trapped alternating between 1 and 2 when you'd certainly benefit from
> IO concurrency and also reordering.  It may need a longer memory.

Just having a hit in itself really is no reason to reduce the distance
IMO. It's completely normal to have a few hits when doing readahead, but a few
hits in a row doesn't mean the IO latency of the subsequent miss doesn't
matter anymore.  But if our entire lookahead only is hits, we actually are in
a pattern where there no misses for the moment, and can reduce the distance.

Another issue is that we, afaict, overweigh hits rather substantially due to
IO combining. A single one block hit (hits are never bigger), decreases IO
distance. But a read of size N only affects distance once, not N times.

I.e. what I am suggesting is something like:

1) Increase distance by * 2 + read_size on a miss

   This would help us to increase with distance = 1, where * 2 only increases
   distance by 1, just as -1 obviously reduces it by 1.

   It'd also somewhat address the over-weighing of hits compared to misses.

2) Only reduce distance if we have no outstanding IOs

> Subject: [PATCH v2 1/4] Move read stream modulo arithmetic into functions.
>
> Several places have open-coded circular index arithmetic.  Make some
> common functions for better readability and consistent assertion
> checking.  This avoids adding yet more open-coding in later patches.
> ---
>  src/backend/storage/aio/read_stream.c | 93 +++++++++++++++++++++------
>  1 file changed, 74 insertions(+), 19 deletions(-)
>
> diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
> index 0e7f5557f5c..0d3dd65bfed 100644
> --- a/src/backend/storage/aio/read_stream.c
> +++ b/src/backend/storage/aio/read_stream.c
> @@ -216,6 +216,67 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
>      stream->buffered_blocknum = blocknum;
>  }
>
> +/*
> + * Increment buffer index, wrapping around at queue size.
> + */
> +static inline void
> +read_stream_advance_buffer(ReadStream *stream, int16 *index)
> +{
> +    Assert(*index >= 0);
> +    Assert(*index < stream->queue_size);
> +
> +    *index += 1;
> +    if (*index == stream->queue_size)
> +        *index = 0;
> +}
> +/*
> + * Increment buffer index by n, wrapping around at queue size.
> + */
> +static inline void
> +read_stream_advance_buffer_n(ReadStream *stream, int16 *index, int16 n)
> +{
> +    Assert(*index >= 0);
> +    Assert(*index < stream->queue_size);
> +    Assert(n <= stream->io_combine_limit);
> +
> +    *index += n;
> +    if (*index >= stream->queue_size)
> +        *index -= stream->queue_size;
> +
> +    Assert(*index >= 0);
> +    Assert(*index < stream->queue_size);
> +}

Not sure if that's it's possible to have a large enough queue_size, but this
doesn't look like it's safe against *index + n > PG_INT16_MAX. The value would
wrap into the negative.

I think all these variables really ought to be unsigned. None of them ever
could be negative afaict. Afaict that would lead to the correct behaviour when
wapping around.

Greetings,

Andres Freund

Re: Streaming relation data out of order

From

Thomas Munro

Date:

11 April, 03:50:43

On Thu, Apr 10, 2025 at 7:35 AM Andres Freund <andres@anarazel.de> wrote:
> 1) Increase distance by * 2 + read_size on a miss
>
>    This would help us to increase with distance = 1, where * 2 only increases
>    distance by 1, just as -1 obviously reduces it by 1.
>
>    It'd also somewhat address the over-weighing of hits compared to misses.
>
> 2) Only reduce distance if we have no outstanding IOs

Interesting.  That's similar in some ways to one of my own attempts:
I had an extra thing called sustain, and that had to be counted down
to zero before distance started decrementing.  I didn't want the
distance itself to get higher than it already did by doubling once per
(combined) IO, I just wanted to defer the decay.  (Names from an
earlier abandoned attempt to use synthesiser terminology: attack,
decay, sustain, release; cute but useless.)  I tried a few different
recipes, but just adding read sizes to sustain works out pretty close
to your pleasingly simple rule 2: sustain until we can't see any IOs
in the lookahead window.  But I also tried adding a bit more for
longer sustain, sustain += read_size * 2 or whatever, so you don't
give up your window so easily on a run of hits, and it also achieve
escape velocity from the gravity of 1.  IDK, I hadn't formed any
strong opinions yet and need to test more stuff, hence lack of
concrete patches.  Thanks, this was helpful, I'll play around with it!

> > +/*
> > + * Increment buffer index by n, wrapping around at queue size.
> > + */
> > +static inline void
> > +read_stream_advance_buffer_n(ReadStream *stream, int16 *index, int16 n)
> > +{
> > +     Assert(*index >= 0);
> > +     Assert(*index < stream->queue_size);
> > +     Assert(n <= stream->io_combine_limit);
> > +
> > +     *index += n;
> > +     if (*index >= stream->queue_size)
> > +             *index -= stream->queue_size;
> > +
> > +     Assert(*index >= 0);
> > +     Assert(*index < stream->queue_size);
> > +}
>
> Not sure if that's it's possible to have a large enough queue_size, but this
> doesn't look like it's safe against *index + n > PG_INT16_MAX. The value would
> wrap into the negative.

No, because the queue is sized such that indexes into this overflow
space also fit under INT16_MAX:

    /*
     * If starting a multi-block I/O near the end of the queue, we might
     * temporarily need extra space for overflowing buffers before they are
     * moved to regular circular position.  This is the maximum extra space we
     * could need.
     */
    queue_overflow = io_combine_limit - 1;

The patch clearly lacks a comment to explain that mystery.  Will add.  Thanks!

> I think all these variables really ought to be unsigned. None of them ever
> could be negative afaict.

Maybe... I tend to default to signed variables when there isn't a good
reason to prefer unsigned.  But I have also been mulling a change here
since:

commit 06fb5612c970b3af95aca3db5a955669b07537ca
Author: Thomas Munro <tmunro@postgresql.org>
Date:   Wed Mar 19 15:23:12 2025 +1300

    Increase io_combine_limit range to 1MB.

Unfortunately uint16 still needs to worry about UINT16_MAX
(effective_io_concurrency = 1000 * io_combine_limit = 128 = 128,000
buffers + a few, still one bit short; before the above commit the cap
was 32,000 buffers + a few, which is why int16 was enough when the
code was written).  Maybe we should just go directly to int, or uint32
if you insist, and stop worrying about overflow.  I think we agreed
that 128,000 buffers = 1000MB of in-progress I/O seemed unrealistic
and 32K ought to be enough for anybody™, but really we're talking
about saving a handful of bytes in struct ReadStream in exchange for a
risk of bugs.  Will write a patch to try that soon...