Home > mailing lists
Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Streaming I/O, vectored I/O (WIP)
Date	January 10, 2024 04:13:37
Msg-id	CA+hUKG+00TZ=POYk309j0pPSKcLe1vQOZ5=4iTsEZoU9dxQDoA@mail.gmail.com Whole thread Raw
In response to	Re: Streaming I/O, vectored I/O (WIP) (Melanie Plageman <melanieplageman@gmail.com>)
Responses	Re: Streaming I/O, vectored I/O (WIP) Re: Streaming I/O, vectored I/O (WIP)
List	pgsql-hackers
Tree view
On Wed, Nov 29, 2023 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Thanks for posting a new version. I've included a review of 0004.

Thanks!  I committed the patches up as far as smgr.c before the
holidays.  The next thing to commit would be the bufmgr.c one for
vectored ReadBuffer(), v5-0001.  Here's my response to your review of
that, which triggered quite a few changes.

See also new version of the streaming_read.c patch, with change list
at end.  (I'll talk about v5-0002, the SMgrRelation lifetime one, over
on the separate thread about that where Heikki posted a better
version.)

> ot -> to

Fixed.

> > -             if (found)
> > -                     pgBufferUsage.shared_blks_hit++;
> > -             else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
> > -                              mode == RBM_ZERO_ON_ERROR)
> > -                     pgBufferUsage.shared_blks_read++;
>
> You've lost this test in your new version. You can do the same thing
> (avoid counting zeroed buffers as blocks read) by moving this
> pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common()
> where you know if you called ZeroBuffer() or CompleteReadBuffers().

Yeah, right.

After thinking about that some more, that's true and that placement
would be good, but only if we look just  at this patch, where we are
chopping ReadBuffer() into two parts (PrepareReadBuffer() and
CompleteReadBuffers()), and then putting them back together again.
However, soon we'll want to use the two functions separately, and we
won't call ReadBuffer[_common]().

New idea: PrepareReadBuffer() can continue to be in charge of bumping
{local,shared}_blks_hit, but {local,shared}_blks_read++ can happen in
CompleteReadBuffers().  There is no counter for zeroed buffers, but if
there ever is in the future, it can go into ZeroBuffer().

In this version, I've moved that into CompleteReadBuffers(), along
with a new comment to explain a pre-existing deficiency in the whole
scheme: there is a race where you finish up counting a read but
someone else actually does the read, and also counts it.  I'm trying
to preserve the existing bean counting logic to the extent possible
across this refactoring.

> > +     }
> > +     else
> > +     {
> > +             bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
> > +                                                      strategy, found, allocated, io_context);
> > +             if (*found)
> > +                     pgBufferUsage.shared_blks_hit++;
> > +             else
> > +                     pgBufferUsage.shared_blks_read++;
> > +     }
> > +     if (bmr.rel)
> > +     {
> > +             pgstat_count_buffer_read(bmr.rel);
>
> This is double-counting reads. You've left the call in
> ReadBufferExtended() as well as adding this here. It should be fine to
> remove it from ReadBufferExtended(). Because you test bmr.rel, leaving
> the call here in PrepareReadBuffer() wouldn't have an effect on
> ReadBuffer_common() callers who don't pass a relation (like recovery).
> The other current callers of ReadBuffer_common() (by way of
> ExtendBufferedRelTo()) who do pass a relation are visibility map and
> freespace map extension, and I don't think we track relation stats for
> the VM and FSM.

Oh yeah.  Right.  Fixed.

> This does continue the practice of counting zeroed buffers as reads in
> table-level stats. But, that is the same as master.

Right.  It is a little strange that pgstast_count_buffer_read()
finishes up in a different location than
pgBufferUsage.{local,shared}_blks_read++, but that's precisely due to
this pre-existing difference in accounting policy.  That generally
seems like POLA failure, so I've added a comment to help us remember
about that, for another day.

> > +             io_start = pgstat_prepare_io_time();
> > +             smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
> > +             pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);
>
> I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes
> will be BLCKSZ and multiplying that by the number of reads should
> produce the number of bytes read.

OK, thanks, fixed.

> >  BufferDesc *
> >  LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> > -                              bool *foundPtr)
> > +                              bool *foundPtr, bool *allocPtr)
> >  {
> >       BufferTag       newTag;                 /* identity of requested block */
> >       LocalBufferLookupEnt *hresult;
> > @@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> >               Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
> >
> >               *foundPtr = PinLocalBuffer(bufHdr, true);
> > +             *allocPtr = false;
> ...
>
> I would prefer you use consistent naming for
> allocPtr/allocatedPtr/allocated. I also think that all the functions
> taking it as an output argument should explain what it is
> (BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of
> digging around to figure it out. You have a nice comment about it above
> PrepareReadBuffer(). I think you may need to resign yourself to
> restating that bit (or some version of it) for all of the functions
> taking it as an argument.

I worked on addressing your complaints, but while doing so I had
second thoughts about even having that argument.  The need for it came
up while working on streaming recovery.  Without it, whenever there
were repeated references to the same block (as happens very often in
the WAL), it'd issue extra useless POSIX_FADV_WILLNEED syscalls, so I
wanted to find a way to distinguish the *first* miss for a given
block, to be able to issue the advice just once before the actual read
happens.

But now I see that other users of streaming reads probably won't
repeatedly stream the same block, and I am not proposing streaming
recovery for PG 17.  Time to simplify.  I decided to kick allocatedPtr
out to think about some more, but I really hope we'll be able to start
a real background I/O instead of issuing advice in PG 18 proposals,
and that'll be negotiated via IO_IN_PROGRESS, so then we'd never need
allocatedPtr.

Thus, removed.

> >  #ifndef BUFMGR_H
> >  #define BUFMGR_H
> >
> > +#include "pgstat.h"
>
> I don't know what we are supposed to do, but I would have included this
> in bufmgr.c (where I actually needed it) instead of including it here.

Fixed.

> > +#include "port/pg_iovec.h"
> >  #include "storage/block.h"
> >  #include "storage/buf.h"
> >  #include "storage/bufpage.h"
> > @@ -47,6 +49,8 @@ typedef enum
> >       RBM_ZERO_AND_CLEANUP_LOCK,      /* Like RBM_ZERO_AND_LOCK, but locks the page
> >                                                                * in "cleanup" mode */
> >       RBM_ZERO_ON_ERROR,                      /* Read, but return an all-zeros page on error */
>
> > +     RBM_WILL_ZERO,                          /* Don't read from disk, caller will call
> > +                                                              * ZeroBuffer() */
>
> It's confusing that this (RBM_WILL_ZERO) is part of this commit since it
> isn't used in this commit.

Yeah.  Removed.

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.

I now also have a much simplified version of the streaming read patch.
The short version is that it has all advanced features removed, so
that now it *only* does the clustering required to build up large
CompleteReadBuffers() calls.  That's short and sweet, and enough for
pg_prewarm to demonstrate 128KB reads, a good first step.

Then WILLNEED advice, and then ramp-up are added as separate patches,
for easier review.  I've got some more patches in development that
would re-add "extended" multi-relation mode with wider callback that
can also stream zeroed buffers, as required so far only by recovery --
but I can propose that later.

Other improvements:

* Melanie didn't like the overloaded term "cluster" (off-list
feedback).  Now I use "block range" to describe a range of contiguous
blocks (blocknum, nblocks).
* KK didn't like "prefetch" being used with various meanings (off-list
feedback).  Better words found.
* pgsr_private might as well be void *; uintptr_t is theoretically
more general but doesn't seem to buy much for realistic use.
* per_io_data might as well be called per_buffer_data (it's not per
"I/O" and that isn't the level of abstraction for this API anyway
which is about blocks and buffers).
* I reordered some function arguments that jumped out after the above changes.
* Once I started down the path of using a flags field to control
various policy stuff as discussed up-thread, it started to seem
clearer that callers probably shouldn't directly control I/O depth,
which was all a bit ad-hoc and unfinished before.  I think we'd likely
want that to be centralised.  New idea: it should be enough to be able
to specify policies as required now and in future with flags.
Thoughts?

I've also included 4 of the WIP patches from earlier (based on an
obsolete version of the vacuum thing, sorry I know you have a much
better one now), which can be mostly ignored.  Here I just wanted to
share working code to drive vectored reads with different access
patterns, and also show the API interactions and how they might each
set flag bits.  For example, parallel seq scan uses
PGSR_FLAG_SEQUENTIAL to insist that its scans are sequential despite
appearances, pg_prewarm uses PGSR_FLAG_FULL to declare that it'll read
the whole relation so there is no point in ramping up, and the
vacuum-related stuff uses PGSR_FLAG_MAINTENANCE to select tuning based
on maintenance_io_concurrency instead of effective_io_concurrency.
Attachment

pgsql-hackers by date:
From: Michael Paquier
Date: 10 January 2024, 04:08:55
Subject: Re: Make NUM_XLOGINSERT_LOCKS configurable
From: Tom Lane
Date: 10 January 2024, 04:22:25
Subject: Re: Fix bogus Asserts in calc_non_nestloop_required_outer
Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

Attachment

Previous

Next