Re: Streaming relation data out of order - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Streaming relation data out of order
Date
Msg-id CA+hUKG+tiy_u5fpf_c0cNHcZ5ZQOZ7A7a48SHY8nrU7AhQQP6Q@mail.gmail.com
Whole thread Raw
In response to Re: Streaming relation data out of order  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Thu, Apr 10, 2025 at 7:35 AM Andres Freund <andres@anarazel.de> wrote:
> 1) Increase distance by * 2 + read_size on a miss
>
>    This would help us to increase with distance = 1, where * 2 only increases
>    distance by 1, just as -1 obviously reduces it by 1.
>
>    It'd also somewhat address the over-weighing of hits compared to misses.
>
> 2) Only reduce distance if we have no outstanding IOs

Interesting.  That's similar in some ways to one of my own attempts:
I had an extra thing called sustain, and that had to be counted down
to zero before distance started decrementing.  I didn't want the
distance itself to get higher than it already did by doubling once per
(combined) IO, I just wanted to defer the decay.  (Names from an
earlier abandoned attempt to use synthesiser terminology: attack,
decay, sustain, release; cute but useless.)  I tried a few different
recipes, but just adding read sizes to sustain works out pretty close
to your pleasingly simple rule 2: sustain until we can't see any IOs
in the lookahead window.  But I also tried adding a bit more for
longer sustain, sustain += read_size * 2 or whatever, so you don't
give up your window so easily on a run of hits, and it also achieve
escape velocity from the gravity of 1.  IDK, I hadn't formed any
strong opinions yet and need to test more stuff, hence lack of
concrete patches.  Thanks, this was helpful, I'll play around with it!

> > +/*
> > + * Increment buffer index by n, wrapping around at queue size.
> > + */
> > +static inline void
> > +read_stream_advance_buffer_n(ReadStream *stream, int16 *index, int16 n)
> > +{
> > +     Assert(*index >= 0);
> > +     Assert(*index < stream->queue_size);
> > +     Assert(n <= stream->io_combine_limit);
> > +
> > +     *index += n;
> > +     if (*index >= stream->queue_size)
> > +             *index -= stream->queue_size;
> > +
> > +     Assert(*index >= 0);
> > +     Assert(*index < stream->queue_size);
> > +}
>
> Not sure if that's it's possible to have a large enough queue_size, but this
> doesn't look like it's safe against *index + n > PG_INT16_MAX. The value would
> wrap into the negative.

No, because the queue is sized such that indexes into this overflow
space also fit under INT16_MAX:

    /*
     * If starting a multi-block I/O near the end of the queue, we might
     * temporarily need extra space for overflowing buffers before they are
     * moved to regular circular position.  This is the maximum extra space we
     * could need.
     */
    queue_overflow = io_combine_limit - 1;

The patch clearly lacks a comment to explain that mystery.  Will add.  Thanks!

> I think all these variables really ought to be unsigned. None of them ever
> could be negative afaict.

Maybe... I tend to default to signed variables when there isn't a good
reason to prefer unsigned.  But I have also been mulling a change here
since:

commit 06fb5612c970b3af95aca3db5a955669b07537ca
Author: Thomas Munro <tmunro@postgresql.org>
Date:   Wed Mar 19 15:23:12 2025 +1300

    Increase io_combine_limit range to 1MB.

Unfortunately uint16 still needs to worry about UINT16_MAX
(effective_io_concurrency = 1000 * io_combine_limit = 128 = 128,000
buffers + a few, still one bit short; before the above commit the cap
was 32,000 buffers + a few, which is why int16 was enough when the
code was written).  Maybe we should just go directly to int, or uint32
if you insist, and stop worrying about overflow.  I think we agreed
that 128,000 buffers = 1000MB of in-progress I/O seemed unrealistic
and 32K ought to be enough for anybody™, but really we're talking
about saving a handful of bytes in struct ReadStream in exchange for a
risk of bugs.  Will write a patch to try that soon...



pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: n_ins_since_vacuum stats for aborted transactions
Next
From: Michael Paquier
Date:
Subject: Re: sync_standbys_defined read/write race on startup