Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Streaming I/O, vectored I/O (WIP)
Date
Msg-id CA+hUKG+2mMiyJEDV-rNb5q__x9VJW5p8oJpGn9ZpovbyoVe1VA@mail.gmail.com
Whole thread Raw
In response to Re: Streaming I/O, vectored I/O (WIP)  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Streaming I/O, vectored I/O (WIP)
Re: Streaming I/O, vectored I/O (WIP)
Re: Streaming I/O, vectored I/O (WIP)
List pgsql-hackers
On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 12/03/2024 15:02, Thomas Munro wrote:
> > src/backend/storage/aio/streaming_read.c
> > src/include/storage/streaming_read.h
>
> Standard file header comments missing.

Fixed.

> It would be nice to have a comment at the top of streaming_read.c,
> explaining at a high level how the circular buffer, lookahead and all
> that works. Maybe even some diagrams.

Done.

> For example, what is head and what is tail? Before reading the code, I
> assumed that 'head' was the next block range to return in
> pg_streaming_read_buffer_get_next(). But I think it's actually the other
> way round?

Yeah.  People seem to have different natural intuitions about head vs
tail in this sort of thing, so I've switched to descriptive names:
stream->{oldest,next}_buffer_index (see also below).

> > /*
> >  * Create a new streaming read object that can be used to perform the
> >  * equivalent of a series of ReadBuffer() calls for one fork of one relation.
> >  * Internally, it generates larger vectored reads where possible by looking
> >  * ahead.
> >  */
> > PgStreamingRead *
> > pg_streaming_read_buffer_alloc(int flags,
> >                                                          void *pgsr_private,
> >                                                          size_t per_buffer_data_size,
> >                                                          BufferAccessStrategy strategy,
> >                                                          BufferManagerRelation bmr,
> >                                                          ForkNumber forknum,
> >                                                          PgStreamingReadBufferCB next_block_cb)
>
> I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
> the work it does is memory allocation. But I'd suggest something like
> 'pg_streaming_read_begin' instead.

I like it.  Done.

> Do we really need the pg_ prefix in these?

Good question.  My understanding of our convention is that pg_ is
needed for local replacements/variants/extensions of things with well
known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
a few places where the word is very common/short and we want to avoid
collisions and make sure it's obviously ours (pg_popcount?), and I
guess places that reflect the name of a SQL identifier with a prefix,
but this doesn't seem to qualify for any of those things.  It's a new
thing, our own thing entirely, and sufficiently distinctive and
unconfusable with standard stuff.  So, prefix removed.

Lots of other patches on top of this one are using "pgsr" as a
variable name, ie containing that prefix; perhaps they would use  "sr"
or "streaming_read" or "stream".  I used "stream" in a few places in
this version.

Other names improved in this version IMHO: pgsr_private ->
callback_private.  I find it clearer, as a way to indicate that the
provider of the callback "owns" it.  I also reordered the arguments:
now it's streaming_read_buffer_begin(..., callback, callback_private,
per_buffer_data_size), to keep those three things together.

> > Buffer
> > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
>
> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
> for a shorter name.

Hmm.  The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers.  I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions?  What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version.  Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.

It's also a bit grammatically weird to say StartReadBuffers() and
WaitReadBuffers() in the bufmgr API...  Hmm.  Perhaps we should just
call it ReadBuffers() and WaitForBufferIO()?  Maybe surprising because
the former isn't just like ReadBuffer() ... but on the other hand no
one said it has to be, and sometimes it even is (when it gets a hit).
I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite
to make the asynchrony optional or explicit if someone has a problem
with that.

(Hmm, that'd be a bit like the Windows native file API, where
ReadFile() is synchronous or asynchronous depending on flags.)

> >
> >       /*
> >        * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
> >        * When it is full, there is an empty element between head and tail.  Head
> >        * can also be empty (nblocks == 0), therefore we need two extra elements
> >        * for non-occupied ranges, on top of max_pinned_buffers to allow for the
> >        * maxmimum possible number of occupied ranges of the smallest possible
> >        * size of one.
> >        */
> >       size = max_pinned_buffers + 2;
>
> I didn't understand this explanation for why it's + 2.

I think the logic was right but the explanation was incomplete, sorry.
It needed one gap between head and tail because head == tail means
empty, and another because the tail item was still 'live' for one
extra call until you examined it one more time to notice that all the
buffers had been extracted.  In any case I have now deleted all that.
In this new version I don't need extra space between head and tail at
all, because empty is detected with stream->pinned_buffers == 0, and
tail moves forwards immediately when you consume buffers.

> >       /*
> >        * Skip the initial ramp-up phase if the caller says we're going to be
> >        * reading the whole relation.  This way we start out doing full-sized
> >        * reads.
> >        */
> >       if (flags & PGSR_FLAG_FULL)
> >               pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
> >       else
> >               pgsr->distance = 1;
>
> Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
> pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
> MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
> perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?

Right, done.

Here are some other changes:

* I'm fairly happy with the ABC adaptive distance algorithm so far, I
think, but I spent more time tidying up the way it is implemented.  I
didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
so I created a new dense array stream->buffers that behaved as a
second circular queue.

* The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
become the GUC that it always wanted to be: buffer_io_size defaulting
to 128kB.  Seems like a reasonable thing to have?  Could also
influence things like bulk write?  (The main problem I have with the
GUC currently is choosing a category, async resources is wrong....)

* By analogy, it started to look a bit funny that each range had room
for a ReadBuffersOperation, and we had enough ranges for
max_pinned_buffers * 1 block range.  So I booted that out to another
dense array, of size max_ios.

* At the same time, Bilal and Andres had been complaining privately
about 'range' management overheads showing up in perf and creating a
regression against master on fully cached scans that do nothing else
(eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
and no CPU work with the page, a somewhat extreme case but a
reasonable way to isolate the management costs); having made the above
change, it suddenly seemed obvious that I should make the buffers
array the 'main' circular queue, pointing off to another place for
information required for dealing with misses.  In this version, there
are no more range objects.  This feels better and occupies and touches
less memory.  See pictures below.

* The 'head range' is replaced by the pending_read_{blocknum,nblocks}
variables, which seems easier to understand.  Essentially the
callback's block numbers are run-length compressed into there until
they can't be, at which point we start a read and start forming a new
pending read.

* A micro-optimisation arranges for the zero slots to be reused over
and over again if we're doing distance = 1, to avoid rotating through
memory for no benefit; I don't yet know if that pays for itself, it's
just a couple of lines...

* Various indexes and sizes that couldn't quite fit in uint8_t but
couldn't possibly exceed a few thousand because they are bounded by
numbers deriving from range-limited GUCs are now int16_t (while I was
looking for low hanging opportunities to reduce memory usage...)

In pictures, the circular queue arrangement changed from
max_pinned_buffers * fat range objects like this, where only the
per-buffer data was outside the range object (because its size is
variable), and in the worst case all-hit case there were a lot of
ranges with only one buffer in them:

                            ranges                             buf/data

                            +--------+--------------+----+      +-----+
                            |        |              |    |      |     |
                            +--------+--------------+----+      +-----+
                    tail -> | 10..10 | buffers[MAX] | op +----->|  ?  |
                            +--------+--------------+----+      +-----+
                            | 42..44 | buffers[MAX] | op +----->|  ?  |
                            +--------+--------------+----+      +-----+
                            | 60..60 | buffers[MAX] | op +--+   |  ?  |
                            +--------+--------------+----+  |   +-----+
                    head -> |        |              |    |  |   |  ?  |
                            +--------+--------------+----+  |   +-----+
                            |        |              |    |  +-->|  ?  |
                            +--------+--------------+----+      +-----+
                            |        |              |    |      |     |
                            +--------+--------------+----+      +-----+

... to something that you might call a "columnar" layout, where ops
are kicked out to their own array of size max_ios (a much smaller
number), and the buffers, per-buffer data and indexes pointing to
optional I/O objects are in parallel arrays of size
max_pinned_buffers, like this:

                            buffers buf/data buf/io       ios (= ops)

                            +----+  +-----+  +---+        +--------+
                            |    |  |     |  |   |  +---->| 42..44 |
                            +----+  +-----+  +---+  |     +--------+
     oldest_buffer_index -> | 10 |  |  ?  |  |   |  | +-->| 60..60 |
                            +----+  +-----+  +---+  | |   +--------+
                            | 42 |  |  ?  |  | 0 +--+ |   |        |
                            +----+  +-----+  +---+    |   +--------+
                            | 43 |  |  ?  |  |   |    |   |        |
                            +----+  +-----+  +---+    |   +--------+
                            | 44 |  |  ?  |  |   |    |   |        |
                            +----+  +-----+  +---+    |   +--------+
                            | 60 |  |  ?  |  | 1 +----+
                            +----+  +-----+  +---+
       next_buffer_index -> |    |  |     |  |   |
                            +----+  +-----+  +---+

In other words, there is essentially no waste/padding now, and in the
all-hit case we stay in the zero'th elements of those arrays so they
can stay red hot.  Still working on validating this refactoring with
other patches and test scenarios.  I hope it's easier to understand,
and does a better job of explaining itself.

I'm also still processing a bunch of performance-related fixups mostly
for bufmgr.c sent by Andres off-list (things like: StartReadBuffer()
argument list is too wide, some things need inline, we should only
initialise the op if it will be needed, oh I squashed that last one
into the patch already), after he and Bilal studied some regressions
in cases with no I/O.  And thinking about Bilal's earlier message
(extra read even when we're going to zero, oops, he's quite right
about that) and a patch he sent me for that.  More on those soon.

Attachment

pgsql-hackers by date:

Previous
From: ShadowGhost
Date:
Subject: Extension for PostgreSQL WIP
Next
From: "David G. Johnston"
Date:
Subject: Re: Extension for PostgreSQL WIP