Thread: Streaming I/O, vectored I/O (WIP)

Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
Hi,

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that.  This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".  That way,
the streaming infrastructure can look as far into the future as it
wants, and then:

 * systematically issue POSIX_FADV_WILLNEED for random access,
replacing patchy ad hoc advice
 * build larger vectored I/Os; eg one preadv() call can replace 16 pread() calls

That's more efficient, and it goes faster.  It's better even on
systems without 'advice' and/or vectored I/O support, because some
I/Os can be merged into wider simple pread/pwrite calls, and various
other small efficiencies come from batching.

The real goal, though, is to make it easier for later work to replace
the I/O subsystem with true asynchronous and concurrent I/O, as
required to get decent performance with direct I/O (and, at a wild
guess, the magic network smgr replacements that many of our colleagues
on this list work on).  Client code such as access methods wouldn't
need to change again to benefit from that, as it would be fully
insulated by the streaming abstraction.

There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.  That's the thing with
the most potential users throughout the source tree and AMs, and I've
attached some work-in-progress examples of half a dozen use cases.

=== 1. Vectored I/O through the layers ===

 * Provide vectored variants of FileRead() and FileWrite().
 * Provide vectored variants of smgrread() and smgrwrite().
 * Provide vectored variant of ReadBuffer().
 * Provide multi-block smgrprefetch().

=== 2. Streaming read API ===

 * Give SMgrRelation pointers a well-defined lifetime.
 * Provide basic streaming read API.

=== 3. Example users of streaming read API ===

 * Use streaming reads in pg_prewarm. [TM]
 * WIP: Use streaming reads in heapam scans. [AF]
 * WIP: Use streaming reads in vacuum. [AF]
 * WIP: Use streaming reads in nbtree vacuum scan. [AF]
 * WIP: Use streaming reads in bitmap heapscan. [MP]
 * WIP: Use streaming reads in recovery. [TM]

=== 4. Some less developed work on vectored writes ===

 * WIP: Provide vectored variant of FlushBuffer().
 * WIP: Use vectored writes in checkpointer.

All of these are WIP; those marked WIP above are double-WIP.  But
there's enough to demo the concept and discuss.  Here are some
assorted notes:

 * probably need to split block-count and I/O-count in stats system?
 * streaming needs to "ramp up", instead of going straight to big reads
 * the buffer pin limit is somewhat primitive
 * more study of buffer pool correctness required
 * 16 block/128KB size limit is not exactly arbitrary but not well
researched (by me at least)
 * various TODOs in user patches

A bit about where this code came from and how it relates to the "AIO"
project[1]:  The idea and terminology 'streaming I/O' are due to
Andres Freund.  This implementation of it is mine, and to keep this
mailing list fun, he hasn't reviewed it yet.  The example user patches
are by Andres, Melanie Plageman and myself, and were cherry picked
from the AIO branch, where they originally ran on top of Andres's
truly asynchronous 'streaming read', which is completely different
code.  It has (or will have) exactly the same API, but it does much
more, with much more infrastructure.  But the AIO branch is far too
much to propose at once.

We might have been a little influenced by a recent discussion on
pgsql-performance[2] that I could summarise as "why do you guys need
to do all this fancy AIO stuff, just give me bigger reads!".  That was
actually a bit of a special case, I think (something is wrong with
btrfs's prefetch heuristics?), but in conversation we realised that
converting parts of PostgreSQL over to a stream-oriented model could
be done independently of AIO, and could offer some nice incremental
benefits already.  So I worked on producing this code with an
identical API that just maps on to old fashioned synchronous I/O
calls, except bigger and better.

The "example user" patches would be proposed separately in their own
threads after some more work, but I wanted to demonstrate the wide
applicability of this style of API in this preview.  Some of these
make use of the ability to attach a bit of extra data to each buffer
-- see Melanie's bitmap heapscan patch, for example.  In later
revisions I'll probably just pick one or two examples to work with for
a smaller core patch set, and then the rest can be developed
separately.  (We thought about btree scans too as a nice high value
area to tackle, but Tomas Vondra is hacking in that area and we didn't
want to step on his toes.)

[1] https://wiki.postgresql.org/wiki/AIO
[2] https://www.postgresql.org/message-id/flat/218fa2e0-bc58-e469-35dd-c5cb35906064%40gmx.net

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 31/08/2023 07:00, Thomas Munro wrote:
> Currently PostgreSQL reads (and writes) data files 8KB at a time.
> That's because we call ReadBuffer() one block at a time, with no
> opportunity for lower layers to do better than that.  This thread is
> about a model where you say which block you'll want next with a
> callback, and then you pull the buffers out of a "stream".

I love this idea! Makes it a lot easier to perform prefetch, as 
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:

  13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by 
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

  4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I 
hoped that this would simplify it. I didn't look closely at that patch, 
so maybe it's simpler even though it's more code.

> There are more kinds of streaming I/O that would be useful, such as
> raw unbuffered files, and of course writes, and I've attached some
> early incomplete demo code for writes (just for fun), but the main
> idea I want to share in this thread is the idea of replacing lots of
> ReadBuffer() calls with the streaming model.

All this makes sense. Some random comments on the patches:

> +    /* Avoid a slightly more expensive kernel call if there is no benefit. */
> +    if (iovcnt == 1)
> +        returnCode = pg_pread(vfdP->fd,
> +                              iov[0].iov_base,
> +                              iov[0].iov_len,
> +                              offset);
> +    else
> +        returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);

How about pushing down this optimization to pg_preadv() itself? 
pg_readv() is currently just a macro if the system provides preadv(), 
but it could be a "static inline" that does the above dance. I think 
that optimization is platform-dependent anyway, pread() might not be any 
faster on some OSs. In particular, if the system doesn't provide 
preadv() and we use the implementation in src/port/preadv.c, it's the 
same kernel call anyway.

> v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

> /*
>  * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
>  * the returned buffer can be used immediately.  Otherwise, a physical read
>  * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
>  * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
>  * caller has the opportunity to coalesce reads of neighboring blocks into one
>  * CompleteReadBuffers() call.
>  *
>  * *found is set to true for a hit, and false for a miss.
>  *
>  * *allocated is set to true for a miss that allocates a buffer for the first
>  * time.  If there are multiple calls to PrepareReadBuffer() for the same
>  * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
>  * read, then only the first such call will receive *allocated == true, which
>  * the caller might use to issue just one prefetch hint.
>  */
> Buffer
> PrepareReadBuffer(BufferManagerRelation bmr,
>                   ForkNumber forkNum,
>                   BlockNumber blockNum,
>                   BufferAccessStrategy strategy,
>                   bool *found,
>                   bool *allocated)
> 

If you decide you don't want to perform the read, after all, is there a 
way to abort it without calling CompleteReadBuffers()? Looking at the 
later patch that introduces the streaming read API, seems that it 
finishes all the reads, so I suppose we don't need an abort function. 
Does it all get cleaned up correctly on error?

> /*
>  * Convert an array of buffer address into an array of iovec objects, and
>  * return the number that were required.  'iov' must have enough space for up
>  * to PG_IOV_MAX elements.
>  */
> static int
> buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
  The comment is a bit inaccurate. There's an assertion that If nblocks 
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for 
up to PG_IOV_MAX elements, that's only because we also assume that 
nblocks <= PG_IOV_MAX.

I don't see anything in the callers (mdreadv() and mdwritev()) to 
prevent them from passing nblocks > PG_IOV_MAX.

in streaming_read.h:

> typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
>         uintptr_t pgsr_private,
>         void *per_io_private,
>         BufferManagerRelation *bmr,
>         ForkNumber *forkNum,
>         BlockNumber *blockNum,
>         ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on 
each read. I see that you used that in the WAL replay prefetching, so I 
guess that makes sense.

> extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
> extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
> extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
> extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the 
WAL replay prefetching patch, and only after calling 
pg_streaming_read_reset(). Could pg_streaming_read_reset() call 
pg_streaming_read_prefetch() directly? Is there any need to "reset" the 
stream, without also starting prefetching?

In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:

> @@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
>   * If the WAL record contains a block reference with the given ID, *rlocator,
>   * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
>   * returns true.  Otherwise returns false.
> + *
> + * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
> + * of the pin is transferred to the caller.
>   */
>  bool
>  XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
> @@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
>      if (blknum)
>          *blknum = bkpb->blkno;
>      if (prefetch_buffer)
> +    {
>          *prefetch_buffer = bkpb->prefetch_buffer;
> +
> +        /*
> +         * Clear this flag is so that we can assert that redo records take
> +         * ownership of all buffers pinned by xlogprefetcher.c.
> +         */
> +        bkpb->prefetch_buffer = InvalidBuffer;
> +    }
>      return true;
>  }

Could these changes be committed independently of all the other changes?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Andres Freund
Date:
Hi,

On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:
> I'm a bit disappointed and surprised by
> v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:
> 
>  4 files changed, 244 insertions(+), 78 deletions(-)
> 
> The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped
> that this would simplify it. I didn't look closely at that patch, so maybe
> it's simpler even though it's more code.

A good chunk of the changes is pretty boring stuff. A good chunk of the
remainder could be simplified a lot - it's partially there because vacuumlazy
changed a lot over the last couple years and because a bit more refactoring is
needed.  I do think it's actually simpler in some ways - besides being more
efficient...


> > v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
> 
> No smgrextendv()? I guess none of the patches here needed it.

I can't really imagine needing it anytime soon - due to the desire to avoid
ENOSPC for pages in the buffer pool the common pattern is to extend relations
with zeroes on disk, then populate those buffers in memory. It's possible that
you could use something like smgrextendv() when operating directly on the smgr
level - but then I suspect you're going to be better off to extend the
relation to the right size in one operation and then just use smgrwritev() to
write out the contents.


> > /*
> >  * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
> >  * the returned buffer can be used immediately.  Otherwise, a physical read
> >  * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
> >  * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
> >  * caller has the opportunity to coalesce reads of neighboring blocks into one
> >  * CompleteReadBuffers() call.
> >  *
> >  * *found is set to true for a hit, and false for a miss.
> >  *
> >  * *allocated is set to true for a miss that allocates a buffer for the first
> >  * time.  If there are multiple calls to PrepareReadBuffer() for the same
> >  * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
> >  * read, then only the first such call will receive *allocated == true, which
> >  * the caller might use to issue just one prefetch hint.
> >  */
> > Buffer
> > PrepareReadBuffer(BufferManagerRelation bmr,
> >                   ForkNumber forkNum,
> >                   BlockNumber blockNum,
> >                   BufferAccessStrategy strategy,
> >                   bool *found,
> >                   bool *allocated)
> > 
> 
> If you decide you don't want to perform the read, after all, is there a way
> to abort it without calling CompleteReadBuffers()?

When would that be needed?


> Looking at the later patch that introduces the streaming read API, seems
> that it finishes all the reads, so I suppose we don't need an abort
> function. Does it all get cleaned up correctly on error?

I think it should.  The buffer error handling is one of the areas where I
really would like to have some way of testing the various cases, it's easy to
get things wrong, and basically impossible to write reliable tests for with
our current infrastructure.


> > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
> >         uintptr_t pgsr_private,
> >         void *per_io_private,
> >         BufferManagerRelation *bmr,
> >         ForkNumber *forkNum,
> >         BlockNumber *blockNum,
> >         ReadBufferMode *mode);
> 
> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> each read. I see that you used that in the WAL replay prefetching, so I
> guess that makes sense.

Yea, that's the origin - I don't like it, but I don't really have a better
idea.


> > extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
> > extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
> > extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
> > extern void pg_streaming_read_free(PgStreamingRead *pgsr);
> 
> Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL
> replay prefetching patch, and only after calling pg_streaming_read_reset().
> Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly?
> Is there any need to "reset" the stream, without also starting prefetching?

Heh, I think this is a discussion Thomas I were having before...

Greetings,

Andres Freund



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Thu, Sep 28, 2023 at 9:13 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:
> > Looking at the later patch that introduces the streaming read API, seems
> > that it finishes all the reads, so I suppose we don't need an abort
> > function. Does it all get cleaned up correctly on error?
>
> I think it should.  The buffer error handling is one of the areas where I
> really would like to have some way of testing the various cases, it's easy to
> get things wrong, and basically impossible to write reliable tests for with
> our current infrastructure.

One thing to highlight is that this patch doesn't create a new state
in that case.  In master, we already have the concept of a buffer with
BM_TAG_VALID but not BM_VALID and not BM_IO_IN_PROGRESS, reachable if
there is an I/O error.  Eventually another reader will try the I/O
again, or the buffer will fall out of the pool.  With this patch it's
the same, it's just a wider window: more kinds of errors might be
thrown in code between Prepare() and Complete() before we even have
BM_IO_IN_PROGRESS.  So there is nothing extra to clean up.  Right?

Yeah, it would be nice to test buffer pool logic directly.  Perhaps
with a C unit test framework[1] and pluggable smgr[2] we could mock up
cases like I/O errors...

> > > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
> > >         uintptr_t pgsr_private,
> > >         void *per_io_private,
> > >         BufferManagerRelation *bmr,
> > >         ForkNumber *forkNum,
> > >         BlockNumber *blockNum,
> > >         ReadBufferMode *mode);
> >
> > I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> > each read. I see that you used that in the WAL replay prefetching, so I
> > guess that makes sense.
>
> Yea, that's the origin - I don't like it, but I don't really have a better
> idea.

Another idea I considered was that streams could be associated with a
single relation, but recovery could somehow manage a set of them.
From a certain point of view, that makes sense (we could be redoing
work that was created by multiple concurrent streams at 'do' time, and
with the approach shown here some clustering opportunities available
at do time are lost at redo time), but it's not at all clear that it's
worth the overheads or complexity, and I couldn't immediately figure
out how to do it.  But I doubt there would ever be any other users of
a single stream with multiple relations, and I agree that this is
somehow not quite satisfying...  Perhaps we should think about that
some more...

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > +     /* Avoid a slightly more expensive kernel call if there is no benefit. */
> > +     if (iovcnt == 1)
> > +             returnCode = pg_pread(vfdP->fd,
> > +                                                       iov[0].iov_base,
> > +                                                       iov[0].iov_len,
> > +                                                       offset);
> > +     else
> > +             returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
>
> How about pushing down this optimization to pg_preadv() itself?
> pg_readv() is currently just a macro if the system provides preadv(),
> but it could be a "static inline" that does the above dance. I think
> that optimization is platform-dependent anyway, pread() might not be any
> faster on some OSs. In particular, if the system doesn't provide
> preadv() and we use the implementation in src/port/preadv.c, it's the
> same kernel call anyway.

Done.  I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already!  I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).

> I don't see anything in the callers (mdreadv() and mdwritev()) to
> prevent them from passing nblocks > PG_IOV_MAX.

The outer loop in md*v() copes with segment boundaries and also makes
sure lengthof(iov) AKA PG_IOV_MAX isn't exceeded (though that couldn't
happen with the current caller).

> in streaming_read.h:
>
> > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
> >         uintptr_t pgsr_private,
> >         void *per_io_private,
> >         BufferManagerRelation *bmr,
> >         ForkNumber *forkNum,
> >         BlockNumber *blockNum,
> >         ReadBufferMode *mode);
>
> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> each read. I see that you used that in the WAL replay prefetching, so I
> guess that makes sense.

In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.

I've added some ramp-up logic.  The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet.  Here is
how the system calls look when you do pg_prewarm():

pread64(32, ..., 8192, 0) = 8192             <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072    <--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072

I guess it could be done in quite a few different ways and I'm open to
better ideas.  This way inserts prefetching stalls but ramps up
quickly and is soon out of the way.  I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size?  Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.

A small detour:  While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem.  One parallel seq scan process does this:

fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...

We don't really want those fadvise() calls.  We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access.  But the logic can't see
that another process is making holes in this process's sequence.  The
two obvious solutions are (1) pass in a flag at the start saying "I
promise this is sequential even if it doesn't look like it, no hints
please" and (2) invent "shared" (cross-process) streaming reads, and
teach all the parallel seq scan processes to get their buffers from
there.

Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now.  So perhaps I should implement (1), which would be another
reason to add a flags argument.  It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.

I've included just the pg_prewarm example user for now while we
discuss the basic infrastructure.  The rest are rebased and in my
public Github branch streaming-read (repo macdice/postgres) if anyone
is interested (don't mind the red CI failures, they're just saying I
ran out of monthly CI credits on the 29th, so close...)

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 28/11/2023 14:17, Thomas Munro wrote:
> On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> +     /* Avoid a slightly more expensive kernel call if there is no benefit. */
>>> +     if (iovcnt == 1)
>>> +             returnCode = pg_pread(vfdP->fd,
>>> +                                                       iov[0].iov_base,
>>> +                                                       iov[0].iov_len,
>>> +                                                       offset);
>>> +     else
>>> +             returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
>>
>> How about pushing down this optimization to pg_preadv() itself?
>> pg_readv() is currently just a macro if the system provides preadv(),
>> but it could be a "static inline" that does the above dance. I think
>> that optimization is platform-dependent anyway, pread() might not be any
>> faster on some OSs. In particular, if the system doesn't provide
>> preadv() and we use the implementation in src/port/preadv.c, it's the
>> same kernel call anyway.
> 
> Done.  I like it, I just feel a bit bad about moving the p*v()
> replacement functions around a couple of times already!  I figured it
> might as well be static inline even if we use the fallback (= Solaris
> and Windows).

LGTM. I think this 0001 patch is ready for commit, independently of the 
rest of the patches.

In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:

> +/* Filename components */
> +#define PG_TEMP_FILES_DIR "pgsql_tmp"
> +#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
> +

These seem out of place, we already have them in common/file_utils.h. 
Other than that, 
v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and 
v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look 
good to me.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 28/11/2023 14:17, Thomas Munro wrote:
> On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> in streaming_read.h:
>>
>>> typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
>>>          uintptr_t pgsr_private,
>>>          void *per_io_private,
>>>          BufferManagerRelation *bmr,
>>>          ForkNumber *forkNum,
>>>          BlockNumber *blockNum,
>>>          ReadBufferMode *mode);
>>
>> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
>> each read. I see that you used that in the WAL replay prefetching, so I
>> guess that makes sense.
> 
> In this version I have introduced an alternative simple callback.
> It's approximately what we had already tried out in an earlier version
> before I started streamifying recovery, but in this version you can
> choose, so recovery can opt for the wider callback.

Ok. Two APIs is a bit redundant, but because most callers would prefer 
the simpler API, that's probably a good tradeoff.

> I've added some ramp-up logic.  The idea is that after we streamify
> everything in sight, we don't want to penalise users that don't really
> need more than one or two blocks, but don't know that yet.  Here is
> how the system calls look when you do pg_prewarm():
> 
> pread64(32, ..., 8192, 0) = 8192             <--- start with just one block
> pread64(32, ..., 16384, 8192) = 16384
> pread64(32, ..., 32768, 24576) = 32768
> pread64(32, ..., 65536, 57344) = 65536
> pread64(32, ..., 131072, 122880) = 131072    <--- soon reading 16
> blocks at a time
> pread64(32, ..., 131072, 253952) = 131072
> pread64(32, ..., 131072, 385024) = 131072

> I guess it could be done in quite a few different ways and I'm open to
> better ideas.  This way inserts prefetching stalls but ramps up
> quickly and is soon out of the way.  I wonder if we would want to make
> that a policy that a caller can disable, if you want to skip the
> ramp-up and go straight for the largest possible I/O size?  Then I
> think we'd need a 'flags' argument to the streaming read constructor
> functions.

I think a 'flags' argument and a way to opt-out of the slow start would 
make sense. pg_prewarm in particular knows that it will read the whole 
relation.

> A small detour:  While contemplating how this interacts with parallel
> sequential scan, which also has a notion of ramping up, I noticed
> another problem.  One parallel seq scan process does this:
> 
> fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
> preadv(32, [...], 2, 35127296) = 131072
> preadv(32, [...], 2, 35258368) = 131072
> fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
> preadv(32, [...], 2, 36175872) = 131072
> preadv(32, [...], 2, 36306944) = 131072
> ...
> 
> We don't really want those fadvise() calls.  We don't get them with
> parallelism disabled, because streaming_read.c is careful not to
> generate advice for sequential workloads based on ancient wisdom from
> this mailing list, re-confirmed on recent Linux: WILLNEED hints
> actually get in the way of Linux's own prefetching and slow you down,
> so we only want them for truly random access.  But the logic can't see
> that another process is making holes in this process's sequence.

Hmm, aside from making the sequential pattern invisible to this process, 
are we defeating Linux's logic too, just by performing the reads from 
multiple processes? The processes might issue the reads to the kernel 
out-of-order.

How bad is the slowdown when you issue WILLNEED hints on sequential access?

> The two obvious solutions are (1) pass in a flag at the start saying
> "I promise this is sequential even if it doesn't look like it, no
> hints please" and (2) invent "shared" (cross-process) streaming
> reads, and teach all the parallel seq scan processes to get their
> buffers from there.
> 
> Idea (2) is interesting to think about but even if it is a useful idea
> (not sure) it is certainly overkill just to solve this little problem
> for now.  So perhaps I should implement (1), which would be another
> reason to add a flags argument.  It's not a perfect solution though
> because some more 'data driven' parallel scans (indexes, bitmaps, ...)
> have a similar problem that is less amenable to top-down kludgery.

(1) seems fine to me.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Melanie Plageman
Date:
On Wed, Nov 29, 2023 at 01:17:19AM +1300, Thomas Munro wrote:

Thanks for posting a new version. I've included a review of 0004.

> I've included just the pg_prewarm example user for now while we
> discuss the basic infrastructure.  The rest are rebased and in my
> public Github branch streaming-read (repo macdice/postgres) if anyone
> is interested (don't mind the red CI failures, they're just saying I
> ran out of monthly CI credits on the 29th, so close...)

I agree it makes sense to commit the interface with just prewarm as a
user. Then we can start new threads for the various streaming read users
(e.g. vacuum, sequential scan, bitmapheapscan).

> From db5de8ab5a1a804f41006239302fdce954cab331 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Sat, 22 Jul 2023 17:31:54 +1200
> Subject: [PATCH v2 4/8] Provide vectored variant of ReadBuffer().
> 
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index f7c67d504c..8ae3a72053 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1046,175 +1048,326 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>          if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
>              flags |= EB_LOCK_FIRST;
>  
> -        return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
> -                                 forkNum, strategy, flags);
> +        *hit = false;
> +
> +        return ExtendBufferedRel(bmr, forkNum, strategy, flags);
>      }
>  
> -    TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
> -                                       smgr->smgr_rlocator.locator.spcOid,
> -                                       smgr->smgr_rlocator.locator.dbOid,
> -                                       smgr->smgr_rlocator.locator.relNumber,
> -                                       smgr->smgr_rlocator.backend);
> +    buffer = PrepareReadBuffer(bmr,
> +                               forkNum,
> +                               blockNum,
> +                               strategy,
> +                               hit,
> +                               &allocated);
> +
> +    /* At this point we do NOT hold any locks. */
> +
> +    if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
> +    {
> +        /* if we just want zeroes and a lock, we're done */
> +        ZeroBuffer(buffer, mode);
> +    }
> +    else if (!*hit)
> +    {
> +        /* we might need to perform I/O */
> +        CompleteReadBuffers(bmr,
> +                            &buffer,
> +                            forkNum,
> +                            blockNum,
> +                            1,
> +                            mode == RBM_ZERO_ON_ERROR,
> +                            strategy);
> +    }
> +
> +    return buffer;
> +}
> +
> +/*
> + * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
> + * the returned buffer can be used immediately.  Otherwise, a physical read
> + * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
> + * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the

ot -> to

> + * caller has the opportunity to coalesce reads of neighboring blocks into one
> + * CompleteReadBuffers() call.
> + *
> + * *found is set to true for a hit, and false for a miss.
> + *
> + * *allocated is set to true for a miss that allocates a buffer for the first
> + * time.  If there are multiple calls to PrepareReadBuffer() for the same
> + * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
> + * read, then only the first such call will receive *allocated == true, which
> + * the caller might use to issue just one prefetch hint.
> + */
> +Buffer
> +PrepareReadBuffer(BufferManagerRelation bmr,
> +                  ForkNumber forkNum,
> +                  BlockNumber blockNum,
> +                  BufferAccessStrategy strategy,
> +                  bool *found,
> +                  bool *allocated)
> +{
> +    BufferDesc *bufHdr;
> +    bool        isLocalBuf;
> +    IOContext    io_context;
> +    IOObject    io_object;
>  
> +    Assert(blockNum != P_NEW);
> +
> +    if (bmr.rel)
> +    {
> +        bmr.smgr = RelationGetSmgr(bmr.rel);
> +        bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
> +    }
> +
> +    isLocalBuf = SmgrIsTemp(bmr.smgr);
>      if (isLocalBuf)
>      {
> -        /*
> -         * We do not use a BufferAccessStrategy for I/O of temporary tables.
> -         * However, in some cases, the "strategy" may not be NULL, so we can't
> -         * rely on IOContextForStrategy() to set the right IOContext for us.
> -         * This may happen in cases like CREATE TEMPORARY TABLE AS...
> -         */
>          io_context = IOCONTEXT_NORMAL;
>          io_object = IOOBJECT_TEMP_RELATION;
> -        bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
> -        if (found)
> -            pgBufferUsage.local_blks_hit++;
> -        else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
> -                 mode == RBM_ZERO_ON_ERROR)
> -            pgBufferUsage.local_blks_read++;
>      }
>      else
>      {
> -        /*
> -         * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
> -         * not currently in memory.
> -         */
>          io_context = IOContextForStrategy(strategy);
>          io_object = IOOBJECT_RELATION;
> -        bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
> -                             strategy, &found, io_context);
> -        if (found)
> -            pgBufferUsage.shared_blks_hit++;
> -        else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
> -                 mode == RBM_ZERO_ON_ERROR)
> -            pgBufferUsage.shared_blks_read++;

You've lost this test in your new version. You can do the same thing
(avoid counting zeroed buffers as blocks read) by moving this
pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common()
where you know if you called ZeroBuffer() or CompleteReadBuffers().

>      }
>  
> -    /* At this point we do NOT hold any locks. */
> +    TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
> +                                       bmr.smgr->smgr_rlocator.locator.spcOid,
> +                                       bmr.smgr->smgr_rlocator.locator.dbOid,
> +                                       bmr.smgr->smgr_rlocator.locator.relNumber,
> +                                       bmr.smgr->smgr_rlocator.backend);
>  
> -    /* if it was already in the buffer pool, we're done */
> -    if (found)
> +    ResourceOwnerEnlarge(CurrentResourceOwner);
> +    if (isLocalBuf)
> +    {
> +        bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, found, allocated);
> +        if (*found)
> +            pgBufferUsage.local_blks_hit++;
> +        else
> +            pgBufferUsage.local_blks_read++;

See comment above.

> +    }
> +    else
> +    {
> +        bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
> +                             strategy, found, allocated, io_context);
> +        if (*found)
> +            pgBufferUsage.shared_blks_hit++;
> +        else
> +            pgBufferUsage.shared_blks_read++;
> +    }
> +    if (bmr.rel)
> +    {
> +        pgstat_count_buffer_read(bmr.rel);

This is double-counting reads. You've left the call in
ReadBufferExtended() as well as adding this here. It should be fine to
remove it from ReadBufferExtended(). Because you test bmr.rel, leaving
the call here in PrepareReadBuffer() wouldn't have an effect on
ReadBuffer_common() callers who don't pass a relation (like recovery).
The other current callers of ReadBuffer_common() (by way of
ExtendBufferedRelTo()) who do pass a relation are visibility map and
freespace map extension, and I don't think we track relation stats for
the VM and FSM.

This does continue the practice of counting zeroed buffers as reads in
table-level stats. But, that is the same as master.

> -     * if we have gotten to this point, we have allocated a buffer for the
> -     * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
> -     * if it's a shared buffer.
> -     */
> -    Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));    /* spinlock not needed */
> +/*
> + * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
> + * cover a cluster of neighboring block numbers.
> + *
> + * Typically this performs one physical vector read covering the block range,
> + * but if some of the buffers have already been read in the meantime by any
> + * backend, zero or multiple reads may be performed.
> + */
> +void
> +CompleteReadBuffers(BufferManagerRelation bmr,
> +                    Buffer *buffers,
> +                    ForkNumber forknum,
> +                    BlockNumber blocknum,
> +                    int nblocks,
> +                    bool zero_on_error,
> +                    BufferAccessStrategy strategy)
> +{
...
> -        pgstat_count_io_op_time(io_object, io_context,
> -                                IOOP_READ, io_start, 1);
> +        /* We found a buffer that we have to read in. */
> +        io_buffers[0] = buffers[i];
> +        io_pages[0] = BufferGetBlock(buffers[i]);
> +        io_first_block = blocknum + i;
> +        io_buffers_len = 1;
>  
> -        /* check for garbage data */
> -        if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
> -                                    PIV_LOG_WARNING | PIV_REPORT_STAT))
> +        /*
> +         * How many neighboring-on-disk blocks can we can scatter-read into
> +         * other buffers at the same time?
> +         */
> +        while ((i + 1) < nblocks &&
> +               CompleteReadBuffersCanStartIO(buffers[i + 1]))
> +        {
> +            /* Must be consecutive block numbers. */
> +            Assert(BufferGetBlockNumber(buffers[i + 1]) ==
> +                   BufferGetBlockNumber(buffers[i]) + 1);
> +
> +            io_buffers[io_buffers_len] = buffers[++i];
> +            io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
> +        }
> +
> +        io_start = pgstat_prepare_io_time();
> +        smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
> +        pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);

I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes
will be BLCKSZ and multiplying that by the number of reads should
produce the number of bytes read.

> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 4efb34b75a..ee9307b612 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -116,7 +116,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
>   */
>  BufferDesc *
>  LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> -                 bool *foundPtr)
> +                 bool *foundPtr, bool *allocPtr)
>  {
>      BufferTag    newTag;            /* identity of requested block */
>      LocalBufferLookupEnt *hresult;
> @@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>          Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
>  
>          *foundPtr = PinLocalBuffer(bufHdr, true);
> +        *allocPtr = false;
>      }
>      else
>      {
> @@ -170,6 +171,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>          pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>  
>          *foundPtr = false;
> +        *allocPtr = true;
>      }

I would prefer you use consistent naming for
allocPtr/allocatedPtr/allocated. I also think that all the functions
taking it as an output argument should explain what it is
(BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of
digging around to figure it out. You have a nice comment about it above
PrepareReadBuffer(). I think you may need to resign yourself to
restating that bit (or some version of it) for all of the functions
taking it as an argument.

>  
> diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
> index 41e26d3e20..e29ca85077 100644
> --- a/src/include/storage/bufmgr.h
> +++ b/src/include/storage/bufmgr.h
> @@ -14,6 +14,8 @@
>  #ifndef BUFMGR_H
>  #define BUFMGR_H
>  
> +#include "pgstat.h"

I don't know what we are supposed to do, but I would have included this
in bufmgr.c (where I actually needed it) instead of including it here.

> +#include "port/pg_iovec.h"
>  #include "storage/block.h"
>  #include "storage/buf.h"
>  #include "storage/bufpage.h"
> @@ -47,6 +49,8 @@ typedef enum
>      RBM_ZERO_AND_CLEANUP_LOCK,    /* Like RBM_ZERO_AND_LOCK, but locks the page
>                                   * in "cleanup" mode */
>      RBM_ZERO_ON_ERROR,            /* Read, but return an all-zeros page on error */

> +    RBM_WILL_ZERO,                /* Don't read from disk, caller will call
> +                                 * ZeroBuffer() */

It's confusing that this (RBM_WILL_ZERO) is part of this commit since it
isn't used in this commit.

>      RBM_NORMAL_NO_LOG,            /* Don't log page as invalid during WAL
>                                   * replay; otherwise same as RBM_NORMAL */
>  } ReadBufferMode;

- Melanie



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> LGTM. I think this 0001 patch is ready for commit, independently of the
> rest of the patches.

Done.

> In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:
>
> > +/* Filename components */
> > +#define PG_TEMP_FILES_DIR "pgsql_tmp"
> > +#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
> > +
>
> These seem out of place, we already have them in common/file_utils.h.

Yeah, they moved from there in f39b2658 and I messed up the rebase.  Fixed.

> Other than that,
> v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
> v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
> good to me.

One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
callers believe that short writes set errno: they error out with a
message including %m.  We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier).  In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error.  Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point).  I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change.  The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short.  Thoughts?

[1] https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 29/11/2023 21:39, Thomas Munro wrote:
> One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> callers believe that short writes set errno: they error out with a
> message including %m.  We have historically set errno = ENOSPC inside
> FileWrite() if the write size was unexpectedly small AND the kernel
> didn't set errno to a non-zero value (having set it to zero ourselves
> earlier).  In FileWriteV(), I didn't want to do that because it is
> expensive to compute the total write size from the vector array and we
> managed to measure an effect due to that in some workloads.
> 
> Note that the smgr patch actually handles short writes by continuing,
> instead of raising an error.  Short writes do already occur in the
> wild on various systems for various rare technical reasons other than
> ENOSPC I have heard (imagine transient failure to acquire some
> temporary memory that the kernel chooses not to wait for, stuff like
> that, though certainly many people and programs believe they should
> not happen[1]), and it seems like a good idea to actually handle them
> as our write sizes increase and the probability of short writes might
> presumably increase.

Maybe we should bite the bullet and always retry short writes in 
FileWriteV(). Is that what you meant by "handling them"?

If the total size is expensive to calculate, how about passing it as an 
extra argument? Presumably it is cheap for the callers to calculate at 
the same time that they build the iovec array?

> With the previous version of the patch, we'd have to change a couple
> of other callers not to believe that short writes are errors and set
> errno (callers are inconsistent on this point).  I don't really love
> that we have "fake" system errors but I also want to stay focused
> here, so in this new version V3 I tried a new approach: I realised I
> can just always set errno without needing the total size, so that
> (undocumented) aspect of the interface doesn't change.  The point
> being that it doesn't matter if you clobber errno with a bogus value
> when the write was non-short.  Thoughts?

Feels pretty ugly, but I don't see anything outright wrong with that.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 29/11/2023 21:39, Thomas Munro wrote:
> > One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> > callers believe that short writes set errno: they error out with a
> > message including %m.  We have historically set errno = ENOSPC inside
> > FileWrite() if the write size was unexpectedly small AND the kernel
> > didn't set errno to a non-zero value (having set it to zero ourselves
> > earlier).  In FileWriteV(), I didn't want to do that because it is
> > expensive to compute the total write size from the vector array and we
> > managed to measure an effect due to that in some workloads.
> >
> > Note that the smgr patch actually handles short writes by continuing,
> > instead of raising an error.  Short writes do already occur in the
> > wild on various systems for various rare technical reasons other than
> > ENOSPC I have heard (imagine transient failure to acquire some
> > temporary memory that the kernel chooses not to wait for, stuff like
> > that, though certainly many people and programs believe they should
> > not happen[1]), and it seems like a good idea to actually handle them
> > as our write sizes increase and the probability of short writes might
> > presumably increase.
>
> Maybe we should bite the bullet and always retry short writes in
> FileWriteV(). Is that what you meant by "handling them"?
> If the total size is expensive to calculate, how about passing it as an
> extra argument? Presumably it is cheap for the callers to calculate at
> the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there.  If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC".  What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

> > With the previous version of the patch, we'd have to change a couple
> > of other callers not to believe that short writes are errors and set
> > errno (callers are inconsistent on this point).  I don't really love
> > that we have "fake" system errors but I also want to stay focused
> > here, so in this new version V3 I tried a new approach: I realised I
> > can just always set errno without needing the total size, so that
> > (undocumented) aspect of the interface doesn't change.  The point
> > being that it doesn't matter if you clobber errno with a bogus value
> > when the write was non-short.  Thoughts?
>
> Feels pretty ugly, but I don't see anything outright wrong with that.

Cool.  I would consider cleaning up all the callers and get rid of
this ENOSPC stuff in independent work, but I didn't want discussion of
that (eg what external/extension code knows about this API?) to derail
THIS project, hence desire to preserve existing behaviour.



Re: Streaming I/O, vectored I/O (WIP)

From
Andres Freund
Date:
Hi,

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > On 29/11/2023 21:39, Thomas Munro wrote:
> > > One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> > > callers believe that short writes set errno: they error out with a
> > > message including %m.  We have historically set errno = ENOSPC inside
> > > FileWrite() if the write size was unexpectedly small AND the kernel
> > > didn't set errno to a non-zero value (having set it to zero ourselves
> > > earlier).  In FileWriteV(), I didn't want to do that because it is
> > > expensive to compute the total write size from the vector array and we
> > > managed to measure an effect due to that in some workloads.
> > >
> > > Note that the smgr patch actually handles short writes by continuing,
> > > instead of raising an error.  Short writes do already occur in the
> > > wild on various systems for various rare technical reasons other than
> > > ENOSPC I have heard (imagine transient failure to acquire some
> > > temporary memory that the kernel chooses not to wait for, stuff like
> > > that, though certainly many people and programs believe they should
> > > not happen[1]), and it seems like a good idea to actually handle them
> > > as our write sizes increase and the probability of short writes might
> > > presumably increase.
> >
> > Maybe we should bite the bullet and always retry short writes in
> > FileWriteV(). Is that what you meant by "handling them"?
> > If the total size is expensive to calculate, how about passing it as an
> > extra argument? Presumably it is cheap for the callers to calculate at
> > the same time that they build the iovec array?
> 
> It's cheap for md.c, because it already has nblocks_this_segment.
> That's one reason I chose to put the retry there.  If we push it down
> to fd.c in order to be able to help other callers, you're right that
> we could pass in the total size (and I guess assert that it's
> correct), but that is sort of annoyingly redundant and further from
> the interface we're wrapping.

> There is another problem with pushing it down to fd.c, though.
> Suppose you try to write 8192 bytes, and the kernel says "you wrote
> 4096 bytes" so your loop goes around again with the second half the
> data and now the kernel says "-1, ENOSPC".  What are you going to do?
> fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> so you'd either have to return -1, ENOSPC (converting short writes
> into actual errors, a lie because you did write some data), or return
> 4096 (and possibly also set errno = ENOSPC as we have always done).
> So you can't really handle this problem at this level, can you?
> Unless you decide that fd.c should get into the business of raising
> errors for I/O failures, which would be a bit of a departure.
> 
> That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Greetings,

Andres Freund



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
> > On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > Maybe we should bite the bullet and always retry short writes in
> > > FileWriteV(). Is that what you meant by "handling them"?
> > > If the total size is expensive to calculate, how about passing it as an
> > > extra argument? Presumably it is cheap for the callers to calculate at
> > > the same time that they build the iovec array?

> > There is another problem with pushing it down to fd.c, though.
> > Suppose you try to write 8192 bytes, and the kernel says "you wrote
> > 4096 bytes" so your loop goes around again with the second half the
> > data and now the kernel says "-1, ENOSPC".  What are you going to do?
> > fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
> > so you'd either have to return -1, ENOSPC (converting short writes
> > into actual errors, a lie because you did write some data), or return
> > 4096 (and possibly also set errno = ENOSPC as we have always done).
> > So you can't really handle this problem at this level, can you?
> > Unless you decide that fd.c should get into the business of raising
> > errors for I/O failures, which would be a bit of a departure.
> >
> > That's why I did the retry higher up in md.c.
>
> I think that's the right call. I think for AIO we can't do retry handling
> purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
> buy us that much in md.c anyway, we still need to handle the cross segment
> case and such, from what I can tell?

Heikki, what do you think about this:  we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Note that in file_utils.c we already have pg_pwritev_with_retry(),
which is clearly related to all this: that is a function that
guarantees to either complete the full pwritev() or throw an ERROR,
but leaves it undefined whether any data has been written on ERROR.
It has to add up the size too, and it adjusts the iovec array at the
same time, so it wouldn't use adjust_iovec_for_partial_transfer().
This is essentially the type of interface that I declined to put into
fd.c's FileWrite() and FileRead() because I feel like it doesn't fit
with the existing functions' primary business of adding vfd support to
well known basic I/O functions that return bytes transferred and set
errno.  Perhaps someone might later want to introduce File*WithRetry()
wrappers or something if that proves useful?  I wouldn't want them for
md.c though because I already know the size.



Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 09/12/2023 02:41, Thomas Munro wrote:
> On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:
>>> On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>> Maybe we should bite the bullet and always retry short writes in
>>>> FileWriteV(). Is that what you meant by "handling them"?
>>>> If the total size is expensive to calculate, how about passing it as an
>>>> extra argument? Presumably it is cheap for the callers to calculate at
>>>> the same time that they build the iovec array?
> 
>>> There is another problem with pushing it down to fd.c, though.
>>> Suppose you try to write 8192 bytes, and the kernel says "you wrote
>>> 4096 bytes" so your loop goes around again with the second half the
>>> data and now the kernel says "-1, ENOSPC".  What are you going to do?
>>> fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
>>> so you'd either have to return -1, ENOSPC (converting short writes
>>> into actual errors, a lie because you did write some data), or return
>>> 4096 (and possibly also set errno = ENOSPC as we have always done).
>>> So you can't really handle this problem at this level, can you?
>>> Unless you decide that fd.c should get into the business of raising
>>> errors for I/O failures, which would be a bit of a departure.
>>>
>>> That's why I did the retry higher up in md.c.
>>
>> I think that's the right call. I think for AIO we can't do retry handling
>> purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
>> buy us that much in md.c anyway, we still need to handle the cross segment
>> case and such, from what I can tell?
> 
> Heikki, what do you think about this:  we could go with the v3 fd.c
> and md.c patches, but move adjust_iovec_for_partial_transfer() into
> src/common/file_utils.c, so that at least that slightly annoying part
> of the job is available for re-use by future code that faces the same
> problem?

Ok, works for me.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Sat, Dec 9, 2023 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Ok, works for me.

I finished up making a few more improvements:

1.  I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

2.  FileReadV/FileWriteV patch:

 * further simplification of the traditional ENOSPC 'guess'
 * unconstify() changed to raw cast (pending [1])
 * fixed the DO_DB()-wrapped debugging code

3.  smgrreadv/smgrwritev patch:

 * improved ENOSPC handling
 * improve description of EOF and ENOSPC handling
 * fixed the sizes reported in dtrace static probes
 * fixed some words in the docs about that
 * changed error messages to refer to "blocks %u..%u"

4.  smgrprefetch-with-nblocks patch has no change, hasn't drawn any
comments hopefully because it is uncontroversial.

I'm planning to commit these fairly soon.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGK3OXFjkOyZiw-DgL2bUqk9by1uGuCnViJX786W%2BfyDSw%40mail.gmail.com

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 11/12/2023 11:12, Thomas Munro wrote:
> 1.  I eventually figured out how to generalise
> compute_remaining_iovec() (as I now call it) so that the existing
> pg_pwritev_with_retry() in file_utils.c could also use it, so that's
> now done in a patch of its own.

In compute_remaining_iovec():
> 'source' and 'destination' may point to the same array, in which
> case it is adjusted in-place; otherwise 'destination' must have enough
> space for 'iovcnt' elements.
Is there any use case for not adjusting it in place? 
pg_pwritev_with_retry() takes a const iovec array, but maybe just remove 
the 'const' and document that it scribbles on it?

> I'm planning to commit these fairly soon.

+1

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Mon, Dec 11, 2023 at 10:28 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 11/12/2023 11:12, Thomas Munro wrote:
> > 1.  I eventually figured out how to generalise
> > compute_remaining_iovec() (as I now call it) so that the existing
> > pg_pwritev_with_retry() in file_utils.c could also use it, so that's
> > now done in a patch of its own.
>
> In compute_remaining_iovec():
> > 'source' and 'destination' may point to the same array, in which
> > case it is adjusted in-place; otherwise 'destination' must have enough
> > space for 'iovcnt' elements.
> Is there any use case for not adjusting it in place?
> pg_pwritev_with_retry() takes a const iovec array, but maybe just remove
> the 'const' and document that it scribbles on it?

I guess I just wanted to preserve pg_pwritev_with_retry()'s existing
prototype, primarily because it matches standard pwritev()/writev().



Re: Streaming I/O, vectored I/O (WIP)

From
Cédric Villemain
Date:
Le 11/12/2023 à 10:12, Thomas Munro a écrit :
> 3.  smgrreadv/smgrwritev patch:
> 
>   * improved ENOSPC handling
>   * improve description of EOF and ENOSPC handling
>   * fixed the sizes reported in dtrace static probes
>   * fixed some words in the docs about that
>   * changed error messages to refer to "blocks %u..%u"
> 
> 4.  smgrprefetch-with-nblocks patch has no change, hasn't drawn any
> comments hopefully because it is uncontroversial.
> 
> I'm planning to commit these fairly soon.


Thanks, very useful additions.
Not sure what you have already done to come next...

I have 2 smalls patches here:
* to use range prefetch in pg_prewarm (smgrprefetch only at the moment, 
using smgrreadv to come next).
* to support nblocks=0 in smgrprefetch (posix_fadvise supports a len=0 
to apply flag from offset to end of file).

Should I add to commitfest ?
---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D



Re: Streaming I/O, vectored I/O (WIP)

From
Melanie Plageman
Date:
I've written a new version of the vacuum streaming read user on top of
the rebased patch set [1]. It differs substantially from Andres' and
includes several refactoring patches that can apply on top of master.
As such, I've proposed those in a separate thread [2]. I noticed mac
and windows fail to build on CI for my branch with the streaming read
code. I haven't had a chance to investigate  -- but I must have done
something wrong on rebase.

- Melanie

[1] https://github.com/melanieplageman/postgres/tree/stepwise_vac_streaming_read
[2] https://www.postgresql.org/message-id/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA%40mail.gmail.com



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Wed, Nov 29, 2023 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Thanks for posting a new version. I've included a review of 0004.

Thanks!  I committed the patches up as far as smgr.c before the
holidays.  The next thing to commit would be the bufmgr.c one for
vectored ReadBuffer(), v5-0001.  Here's my response to your review of
that, which triggered quite a few changes.

See also new version of the streaming_read.c patch, with change list
at end.  (I'll talk about v5-0002, the SMgrRelation lifetime one, over
on the separate thread about that where Heikki posted a better
version.)

> ot -> to

Fixed.

> > -             if (found)
> > -                     pgBufferUsage.shared_blks_hit++;
> > -             else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
> > -                              mode == RBM_ZERO_ON_ERROR)
> > -                     pgBufferUsage.shared_blks_read++;
>
> You've lost this test in your new version. You can do the same thing
> (avoid counting zeroed buffers as blocks read) by moving this
> pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common()
> where you know if you called ZeroBuffer() or CompleteReadBuffers().

Yeah, right.

After thinking about that some more, that's true and that placement
would be good, but only if we look just  at this patch, where we are
chopping ReadBuffer() into two parts (PrepareReadBuffer() and
CompleteReadBuffers()), and then putting them back together again.
However, soon we'll want to use the two functions separately, and we
won't call ReadBuffer[_common]().

New idea: PrepareReadBuffer() can continue to be in charge of bumping
{local,shared}_blks_hit, but {local,shared}_blks_read++ can happen in
CompleteReadBuffers().  There is no counter for zeroed buffers, but if
there ever is in the future, it can go into ZeroBuffer().

In this version, I've moved that into CompleteReadBuffers(), along
with a new comment to explain a pre-existing deficiency in the whole
scheme: there is a race where you finish up counting a read but
someone else actually does the read, and also counts it.  I'm trying
to preserve the existing bean counting logic to the extent possible
across this refactoring.

> > +     }
> > +     else
> > +     {
> > +             bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
> > +                                                      strategy, found, allocated, io_context);
> > +             if (*found)
> > +                     pgBufferUsage.shared_blks_hit++;
> > +             else
> > +                     pgBufferUsage.shared_blks_read++;
> > +     }
> > +     if (bmr.rel)
> > +     {
> > +             pgstat_count_buffer_read(bmr.rel);
>
> This is double-counting reads. You've left the call in
> ReadBufferExtended() as well as adding this here. It should be fine to
> remove it from ReadBufferExtended(). Because you test bmr.rel, leaving
> the call here in PrepareReadBuffer() wouldn't have an effect on
> ReadBuffer_common() callers who don't pass a relation (like recovery).
> The other current callers of ReadBuffer_common() (by way of
> ExtendBufferedRelTo()) who do pass a relation are visibility map and
> freespace map extension, and I don't think we track relation stats for
> the VM and FSM.

Oh yeah.  Right.  Fixed.

> This does continue the practice of counting zeroed buffers as reads in
> table-level stats. But, that is the same as master.

Right.  It is a little strange that pgstast_count_buffer_read()
finishes up in a different location than
pgBufferUsage.{local,shared}_blks_read++, but that's precisely due to
this pre-existing difference in accounting policy.  That generally
seems like POLA failure, so I've added a comment to help us remember
about that, for another day.

> > +             io_start = pgstat_prepare_io_time();
> > +             smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
> > +             pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);
>
> I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes
> will be BLCKSZ and multiplying that by the number of reads should
> produce the number of bytes read.

OK, thanks, fixed.

> >  BufferDesc *
> >  LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> > -                              bool *foundPtr)
> > +                              bool *foundPtr, bool *allocPtr)
> >  {
> >       BufferTag       newTag;                 /* identity of requested block */
> >       LocalBufferLookupEnt *hresult;
> > @@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> >               Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
> >
> >               *foundPtr = PinLocalBuffer(bufHdr, true);
> > +             *allocPtr = false;
> ...
>
> I would prefer you use consistent naming for
> allocPtr/allocatedPtr/allocated. I also think that all the functions
> taking it as an output argument should explain what it is
> (BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of
> digging around to figure it out. You have a nice comment about it above
> PrepareReadBuffer(). I think you may need to resign yourself to
> restating that bit (or some version of it) for all of the functions
> taking it as an argument.

I worked on addressing your complaints, but while doing so I had
second thoughts about even having that argument.  The need for it came
up while working on streaming recovery.  Without it, whenever there
were repeated references to the same block (as happens very often in
the WAL), it'd issue extra useless POSIX_FADV_WILLNEED syscalls, so I
wanted to find a way to distinguish the *first* miss for a given
block, to be able to issue the advice just once before the actual read
happens.

But now I see that other users of streaming reads probably won't
repeatedly stream the same block, and I am not proposing streaming
recovery for PG 17.  Time to simplify.  I decided to kick allocatedPtr
out to think about some more, but I really hope we'll be able to start
a real background I/O instead of issuing advice in PG 18 proposals,
and that'll be negotiated via IO_IN_PROGRESS, so then we'd never need
allocatedPtr.

Thus, removed.

> >  #ifndef BUFMGR_H
> >  #define BUFMGR_H
> >
> > +#include "pgstat.h"
>
> I don't know what we are supposed to do, but I would have included this
> in bufmgr.c (where I actually needed it) instead of including it here.

Fixed.

> > +#include "port/pg_iovec.h"
> >  #include "storage/block.h"
> >  #include "storage/buf.h"
> >  #include "storage/bufpage.h"
> > @@ -47,6 +49,8 @@ typedef enum
> >       RBM_ZERO_AND_CLEANUP_LOCK,      /* Like RBM_ZERO_AND_LOCK, but locks the page
> >                                                                * in "cleanup" mode */
> >       RBM_ZERO_ON_ERROR,                      /* Read, but return an all-zeros page on error */
>
> > +     RBM_WILL_ZERO,                          /* Don't read from disk, caller will call
> > +                                                              * ZeroBuffer() */
>
> It's confusing that this (RBM_WILL_ZERO) is part of this commit since it
> isn't used in this commit.

Yeah.  Removed.

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.

I now also have a much simplified version of the streaming read patch.
The short version is that it has all advanced features removed, so
that now it *only* does the clustering required to build up large
CompleteReadBuffers() calls.  That's short and sweet, and enough for
pg_prewarm to demonstrate 128KB reads, a good first step.

Then WILLNEED advice, and then ramp-up are added as separate patches,
for easier review.  I've got some more patches in development that
would re-add "extended" multi-relation mode with wider callback that
can also stream zeroed buffers, as required so far only by recovery --
but I can propose that later.

Other improvements:

* Melanie didn't like the overloaded term "cluster" (off-list
feedback).  Now I use "block range" to describe a range of contiguous
blocks (blocknum, nblocks).
* KK didn't like "prefetch" being used with various meanings (off-list
feedback).  Better words found.
* pgsr_private might as well be void *; uintptr_t is theoretically
more general but doesn't seem to buy much for realistic use.
* per_io_data might as well be called per_buffer_data (it's not per
"I/O" and that isn't the level of abstraction for this API anyway
which is about blocks and buffers).
* I reordered some function arguments that jumped out after the above changes.
* Once I started down the path of using a flags field to control
various policy stuff as discussed up-thread, it started to seem
clearer that callers probably shouldn't directly control I/O depth,
which was all a bit ad-hoc and unfinished before.  I think we'd likely
want that to be centralised.  New idea: it should be enough to be able
to specify policies as required now and in future with flags.
Thoughts?

I've also included 4 of the WIP patches from earlier (based on an
obsolete version of the vacuum thing, sorry I know you have a much
better one now), which can be mostly ignored.  Here I just wanted to
share working code to drive vectored reads with different access
patterns, and also show the API interactions and how they might each
set flag bits.  For example, parallel seq scan uses
PGSR_FLAG_SEQUENTIAL to insist that its scans are sequential despite
appearances, pg_prewarm uses PGSR_FLAG_FULL to declare that it'll read
the whole relation so there is no point in ramping up, and the
vacuum-related stuff uses PGSR_FLAG_MAINTENANCE to select tuning based
on maintenance_io_concurrency instead of effective_io_concurrency.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 10/01/2024 06:13, Thomas Munro wrote:
> Bikeshedding call: I am open to better suggestions for the names
> PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
> grammatically clumsy.

How will these functions work in the brave new async I/O world? I assume 
PrepareReadBuffer() will initiate the async I/O, and 
CompleteReadBuffers() will wait for it to complete. How about 
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and 
WaitBufferRead()?

About the signature of those functions: Does it make sense for 
CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector 
of buffers? If StartReadBuffer() initiates the async I/O immediately, is 
there any benefit to batching the waiting?

If StartReadBuffer() starts the async I/O, the idea that you can call 
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think 
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for 
you in RBM_ZERO_* modes.


Putting all that together, I propose:

/*
  * Initiate reading a block from disk to the buffer cache.
  *
  * XXX: Until we have async I/O, this just allocates the buffer in the 
buffer
  * cache. The actual I/O happens in WaitReadBuffer().
  */
Buffer
StartReadBuffer(BufferManagerRelation bmr,
                ForkNumber forkNum,
                BlockNumber blockNum,
                BufferAccessStrategy strategy,
                ReadBufferMode mode,
                bool *foundPtr);

/*
  * Wait for a read that was started earlier with StartReadBuffer() to 
finish.
  *
  * XXX: Until we have async I/O, this is the function that actually 
performs
  * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
  * will try to perform all of them in one syscall. Subsequent calls to
  * WaitReadBuffer(), for those other buffers, will finish quickly.
  */
void
WaitReadBuffer(Buffer buf);


I'm not sure how well this fits with the streaming read API. The 
streaming read code performs grouping of adjacent blocks to one 
CompleteReadBuffers() call. If WaitReadBuffer() does the batching, 
that's not really required. But does that make sense with async I/O? 
With async I/O, will you need a vectorized version of StartReadBuffer() too?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 10/01/2024 06:13, Thomas Munro wrote:
> > Bikeshedding call: I am open to better suggestions for the names
> > PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
> > grammatically clumsy.
>
> How will these functions work in the brave new async I/O world? I assume
> PrepareReadBuffer() will initiate the async I/O, and
> CompleteReadBuffers() will wait for it to complete. How about
> StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
> WaitBufferRead()?

What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

 * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
 * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
 * CompleteReadBuffers() -> waits, completes as necessary

In the proposed synchronous version, the middle step is missing, but
streaming_read.c directly calls smgrprefetch() instead.  I thought
about shoving that inside a prosthetic StartReadBuffers() function,
but I backed out of simulating asynchronous I/O too fancifully.  The
streaming read API is where we really want to stabilise a nice API, so
we can moves things around behind it if required.

A bit of analysis of the one block -> nblocks change and the
synchronous -> asynchronous change:

Two things are new in a world with nblocks > 1.  (1) We needed to be
able to set BM_IO_IN_PROGRESS on more than one block at a time, but
commit 12f3867f already provided that, and (2) someone else might come
along and read a block in the middle of our range, effectively
chopping our range into subranges.  That's true also in master but
when nblocks === 1 that's all-or-nothing, and now we have partial
cases.  In the proposed synchronous code, CompleteReadBuffers() claims
as many contiguous BM_IO_IN_PROGRESS flags as it in the range, and
then loops process the rest, skipping over any blocks that are already
done.  Further down in md.c, you might also cross a segment boundary.
So that's two different reasons while a single call to
CompleteReadBuffers() might finish up generating zero or more than one
I/O system call, though very often it's one.

Hmm, while spelling that out, I noticed an obvious problem and
improvement to make to that part of v5.  If backend #1 is trying to
read blocks 101..103 and acquires BM_IO_IN_PROGRESS for 101, but
backend #2 comes along and starts reading block 102 first, backend
#1's StartBufferIO() call would wait for 102's I/O CV while it still
holds BM_IO_IN_PROGRESS for block 101, potentially blocking a third
backend #3 that wants to read block 101 even though no I/O is in
progress for that block yet!  At least that's deadlock free (because
always in block order), but it seems like undesirable lock chaining.
Here is my proposed improvement: StartBufferIO() gains a nowait flag.
For the head block we wait, but while trying to build a larger range
we don't.  We'll try 102 again in the next loop, with a wait.  Here is
a small fixup for that.

In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
take place in StartReadBuffers() instead, which would be responsible
for kicking off asynchronous I/Os (= asking a background worker to
call pread(), or equivalent fancy kernel async APIs).  One question is
what it does if it finds a block in the middle that chops the read up,
or for that matter a segment boundary.  I don't think we have to
decide now, but the two options seem to be that it starts one single
I/O and reports its size, making it the client's problem to call again
with the rest, or that it starts more than one I/O and they are
somehow chained together so that the caller doesn't know about that
and can later wait for all of them to finish using just one <handle
thing>.

(The reason why I'm not 100% certain how it will look is that the
real, working code in the aio branch right now doesn't actually expose
a vector/nblocks bufmgr interface at all, yet.  Andres's original
prototype had a single-block Start*(), Complete*() design, but a lower
level of the AIO system notices if pending read operations are
adjacent and could be merged.  While discussing all this we decided it
was a bit strange to have lower code deal with allocating, chaining
and processing lots of separate I/O objects in shared memory, when
higher level code could often work in bigger ranges up front, and then
interact with the AIO subsystem with many fewer objects and steps.
Also, the present simple and lightweight synchronous proposal that
lacks the whole subsystem that could do that by magic.)

> About the signature of those functions: Does it make sense for
> CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector
> of buffers? If StartReadBuffer() initiates the async I/O immediately, is
> there any benefit to batching the waiting?
>
> If StartReadBuffer() starts the async I/O, the idea that you can call
> ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
> StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
> you in RBM_ZERO_* modes.

Yeah, good thoughts, and topics that have occupied me for some time
now.  I also thought that StartReadBuffer() should take
ReadBufferMode, but I came to the idea that it probably shouldn't like
this:

I started working on all this by trying to implement the most
complicated case I could imagine, streaming recovery, and then working
back to the easy cases that just do scans with RBM_NORMAL.  In
recovery, we can predict that a block will be zeroed using WAL flags,
and pre-existing cross-checks at redo time that enforce that the flags
and redo code definitely agree on that, but we can't predict which
exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
get_cleanup_lock=true, as the comment warns them not to, but I
digress).

That's OK, because we can't take locks while looking ahead in recovery
anyway (the redo routine carefully controls lock order/protocol), so
the code to actually do the locking needs to be somewhere near the
output end of the stream when the redo code calls
XLogReadBufferForRedoExtended().  But if you try to use RBM_XXX in
these interfaces, it begins to look pretty funny: the streaming
callback needs to be able to say which ReadBufferMode, but anywhere
near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
the zero modes to use, and then the redo routine needs to pass in the
RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
XLogReadBufferForRedoExtended() and it does it in a separate step
anyway, so we are ignoring ReadBufferMode.  But that feels just wrong
-- we'd be using RBM but implementing them only partially.

Another way to put it is that ReadBufferMode actually conflates a
bunch of different behaviour that applies at different times that we
are now separating, and recovery reveals this most clearly because it
doesn't have all the information needed while looking ahead.  It might
be possible to shove more information in the WAL to fix the
information problem, but it seemed more natural to me to separate the
aspects of ReadBufferMode, because that isn't the only problem:
groupwise-processing of lock doesn't even make sense.

So I teased a couple of those aspects out into separate flags, for example:

The streaming read interface has two variants: the "simple" implicit
RBM_NORMAL, single relation, single fork version that is used in most
client examples and probably everything involving the executor, and
the "extended" version (like the earlier versions in this thread,
removed for now based on complaints that most early uses don't use it,
will bring it back separately later with streaming recovery patches).
In the extended version, the streaming callback can set *will_zero =
true, which is about all the info that recovery can figure out from
the WAL anyway, and then XLogReadBufferForRedoExtended() will later
call ZeroBuffer() because at that time we have the ReadBufferMode.

The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
right time and makes sense to process as a batch, so it becomes a
flag.

> Putting all that together, I propose:
>
> /*
>   * Initiate reading a block from disk to the buffer cache.
>   *
>   * XXX: Until we have async I/O, this just allocates the buffer in the
> buffer
>   * cache. The actual I/O happens in WaitReadBuffer().
>   */
> Buffer
> StartReadBuffer(BufferManagerRelation bmr,
>                                 ForkNumber forkNum,
>                                 BlockNumber blockNum,
>                                 BufferAccessStrategy strategy,
>                                 ReadBufferMode mode,
>                                 bool *foundPtr);
>
> /*
>   * Wait for a read that was started earlier with StartReadBuffer() to
> finish.
>   *
>   * XXX: Until we have async I/O, this is the function that actually
> performs
>   * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
>   * will try to perform all of them in one syscall. Subsequent calls to
>   * WaitReadBuffer(), for those other buffers, will finish quickly.
>   */
> void
> WaitReadBuffer(Buffer buf);

I'm confused about where the extra state lives that would allow the
communication required to build a larger I/O.  In the AIO branch, it
does look a little more like that, but there is more magic state and
machinery hiding behind the curtain: the backend's pending I/O list
builds up a chain of I/Os, and when you try to wait, if it hasn't
already been submitted to the kernel/bgworkers yet it will be, and
before that merging will happen.  So you get bigger I/Os without
having to say so explicitly.

For this synchronous version (and hopefully soon a more efficient
improved version in the AIO branch), we want to take advantage of the
client's pre-existing and more cheaply obtained knowledge of ranges.

> I'm not sure how well this fits with the streaming read API. The
> streaming read code performs grouping of adjacent blocks to one
> CompleteReadBuffers() call. If WaitReadBuffer() does the batching,
> that's not really required. But does that make sense with async I/O?

> With async I/O, will you need a vectorized version of StartReadBuffer() too?

I think so, yes.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 11/01/2024 05:19, Thomas Munro wrote:
> On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> On 10/01/2024 06:13, Thomas Munro wrote:
>>> Bikeshedding call: I am open to better suggestions for the names
>>> PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
>>> grammatically clumsy.
>>
>> How will these functions work in the brave new async I/O world? I assume
>> PrepareReadBuffer() will initiate the async I/O, and
>> CompleteReadBuffers() will wait for it to complete. How about
>> StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
>> WaitBufferRead()?
> 
> What we have imagined so far is that the asynchronous version would
> probably have three steps, like this:
> 
>   * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
>   * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
>   * CompleteReadBuffers() -> waits, completes as necessary

Ok. It feels surprising to have three steps. I understand that you need 
two steps, one to start the I/O and another to wait for them to finish, 
but why do you need separate Prepare and Start steps? What can you do in 
between them? (You explained that. I'm just saying that that's my 
initial reaction when seeing that API. It is surprising.)

>> If StartReadBuffer() starts the async I/O, the idea that you can call
>> ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
>> StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
>> you in RBM_ZERO_* modes.
> 
> Yeah, good thoughts, and topics that have occupied me for some time
> now.  I also thought that StartReadBuffer() should take
> ReadBufferMode, but I came to the idea that it probably shouldn't like
> this:
> 
> I started working on all this by trying to implement the most
> complicated case I could imagine, streaming recovery, and then working
> back to the easy cases that just do scans with RBM_NORMAL.  In
> recovery, we can predict that a block will be zeroed using WAL flags,
> and pre-existing cross-checks at redo time that enforce that the flags
> and redo code definitely agree on that, but we can't predict which
> exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
> RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
> get_cleanup_lock=true, as the comment warns them not to, but I
> digress).
> 
> That's OK, because we can't take locks while looking ahead in recovery
> anyway (the redo routine carefully controls lock order/protocol), so
> the code to actually do the locking needs to be somewhere near the
> output end of the stream when the redo code calls
> XLogReadBufferForRedoExtended().  But if you try to use RBM_XXX in
> these interfaces, it begins to look pretty funny: the streaming
> callback needs to be able to say which ReadBufferMode, but anywhere
> near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
> need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
> the zero modes to use, and then the redo routine needs to pass in the
> RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
> XLogReadBufferForRedoExtended() and it does it in a separate step
> anyway, so we are ignoring ReadBufferMode.  But that feels just wrong
> -- we'd be using RBM but implementing them only partially.

I see. When you're about to zero the page, there's not much point in 
splitting the operation into Prepare/Start/Complete stages anyway. 
You're not actually doing any I/O. Perhaps it's best to have a separate 
"Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)" 
function that does the same as 
ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

> The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
> right time and makes sense to process as a batch, so it becomes a
> flag.

+1

>> Putting all that together, I propose:
>>
>> /*
>>    * Initiate reading a block from disk to the buffer cache.
>>    *
>>    * XXX: Until we have async I/O, this just allocates the buffer in the
>> buffer
>>    * cache. The actual I/O happens in WaitReadBuffer().
>>    */
>> Buffer
>> StartReadBuffer(BufferManagerRelation bmr,
>>                                  ForkNumber forkNum,
>>                                  BlockNumber blockNum,
>>                                  BufferAccessStrategy strategy,
>>                                  ReadBufferMode mode,
>>                                  bool *foundPtr);
>>
>> /*
>>    * Wait for a read that was started earlier with StartReadBuffer() to
>> finish.
>>    *
>>    * XXX: Until we have async I/O, this is the function that actually
>> performs
>>    * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
>>    * will try to perform all of them in one syscall. Subsequent calls to
>>    * WaitReadBuffer(), for those other buffers, will finish quickly.
>>    */
>> void
>> WaitReadBuffer(Buffer buf);
> 
> I'm confused about where the extra state lives that would allow the
> communication required to build a larger I/O.  In the AIO branch, it
> does look a little more like that, but there is more magic state and
> machinery hiding behind the curtain: the backend's pending I/O list
> builds up a chain of I/Os, and when you try to wait, if it hasn't
> already been submitted to the kernel/bgworkers yet it will be, and
> before that merging will happen.  So you get bigger I/Os without
> having to say so explicitly.

Yeah, I was thinking that there would be a global variable that holds a 
list of operations started with StartReadBuffer().

> For this synchronous version (and hopefully soon a more efficient
> improved version in the AIO branch), we want to take advantage of the
> client's pre-existing and more cheaply obtained knowledge of ranges.

Ok.

> In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
> take place in StartReadBuffers() instead, which would be responsible
> for kicking off asynchronous I/Os (= asking a background worker to
> call pread(), or equivalent fancy kernel async APIs).  One question is
> what it does if it finds a block in the middle that chops the read up,
> or for that matter a segment boundary.  I don't think we have to
> decide now, but the two options seem to be that it starts one single
> I/O and reports its size, making it the client's problem to call again
> with the rest, or that it starts more than one I/O and they are
> somehow chained together so that the caller doesn't know about that
> and can later wait for all of them to finish using just one <handle
> thing>.
> 
> (The reason why I'm not 100% certain how it will look is that the
> real, working code in the aio branch right now doesn't actually expose
> a vector/nblocks bufmgr interface at all, yet.  Andres's original
> prototype had a single-block Start*(), Complete*() design, but a lower
> level of the AIO system notices if pending read operations are
> adjacent and could be merged.  While discussing all this we decided it
> was a bit strange to have lower code deal with allocating, chaining
> and processing lots of separate I/O objects in shared memory, when
> higher level code could often work in bigger ranges up front, and then
> interact with the AIO subsystem with many fewer objects and steps.
> Also, the present simple and lightweight synchronous proposal that
> lacks the whole subsystem that could do that by magic.)

Hmm, let's sketch out what this would look like with the approach that 
you always start one I/O and report the size:

/*
  * Initiate reading a range of blocks from disk to the buffer cache.
  *
  * If the pages were already found in the buffer cache, returns true.
  * Otherwise false, and the caller must call WaitReadBufferRange() to
  * wait for the I/O to finish, before accessing the buffers.
  *
  * 'buffers' is a caller-supplied array large enough to hold (endBlk -
  * startBlk) buffers. It is filled with the buffers that the pages are
  * read into.
  *
  * This always starts a read of at least one block, and tries to
  * initiate one read I/O for the whole range if possible. But if the
  * read cannot be performed as a single I/O, a partial read is started,
  * and *endBlk is updated to reflect the range for which the read was
  * started. The caller can make another call to read the rest of the
  * range. A partial read can occur if some, but not all, of the pages
  * are already in the buffer cache, or because the range crosses a
  * segment boundary.
  *
  * XXX: Until we have async I/O, this just allocates the buffers in the
  * buffer cache. And perhaps calls smgrprefetch(). The actual I/O
  * happens in WaitReadBufferRange().
  */
bool
StartReadBufferRange(BufferManagerRelation bmr,
                      ForkNumber forkNum,
                      BlockNumber startBlk,
                      BlockNumber *endBlk,
                      BufferAccessStrategy strategy,
                      Buffer *buffers);

/*
  * Wait for a read that was started earlier with StartReadBufferRange()
  * to finish.
  *
  * XXX: Until we have async I/O, this is the function that actually
  * performs
  * the I/O. StartReadBufferRange already checked that the pages can be
  * read in one preadv() syscall. However, it's possible that another
  * backend performed the read for some of the pages in between. In that
  * case this will perform multiple syscalls, after all.
  */
void
WaitReadBufferRange(Buffer *buffers, int nbuffers, bool zero_on_error);

/*
  * Allocates a buffer for the given page in the buffer cache, and locks
  * the page. No I/O is initiated. The caller must initialize it and mark
  * the buffer dirty before releasing the lock.
  *
  * This is equivalent to ReadBuffer(RBM_ZERO_AND_LOCK) or
  * ReadBuffer(RBM_ZERO_AND_CLEANUP_LOCK).
  */
Buffer
ZeroBuffer(BufferManagerRelation bmr,
           ForkNumber forkNum,
           BlockNumber blockNum,
           BufferAccessStrategy strategy,
           bool cleanup_lock);

This range-oriented API fits the callers pretty well: the streaming read 
API works with block ranges already. There is no need for separate 
Prepare and Start steps.

One weakness is that if StartReadBufferRange() finds that the range is 
"chopped up", it needs to return and throw away the work it had to do to 
look up the next buffer. So in the extreme case that every other block 
in the range is in the buffer cache, each call would look up two buffers 
in the buffer cache, startBlk and startBlk + 1, but only return one 
buffer to the caller.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Ok. It feels surprising to have three steps. I understand that you need
> two steps, one to start the I/O and another to wait for them to finish,
> but why do you need separate Prepare and Start steps? What can you do in
> between them? (You explained that. I'm just saying that that's my
> initial reaction when seeing that API. It is surprising.)

Actually I don't think I explained that very well.  First, some more
detail about how a two-step version would work:

* we only want to start one I/O in an StartReadBuffers() call, because
otherwise it is hard/impossible for the caller to cap concurrency I/O
* therefore StartReadBuffers() can handle sequences matching /^H*I*H*/
("H" = hit, "I" = miss, I/O) in one call
* in the asynchronous version, "I" in that pattern means we got
BM_IO_IN_PROGRESS
* in the synchronous version, "I" means that it's not valid, not
BM_IN_IN_PROGRESS, but we won't actually try to get BM_IO_IN_PROGRESS
until the later Complete/Wait call (and then the answer might chagne,
but we'll just deal with that by looping in the synchronous version)
* streaming_read.c has to deal with buffering up work that
StartReadBuffers() didn't accept
* that's actually quite easy, you just use the rest to create a new
range in the next slot

Previously I thought the requirement to deal with buffering future
stuff that StartReadBuffers() couldn't accept yet was a pain, and life
became so much simpler once I deleted all that and exposed
PrepareReadBuffer() to the calling code.  Perhaps I just hadn't done a
good enough job of that.

The other disadvantage you reminded me of was the duplicate buffer
lookup in certain unlucky patterns, which I had forgotten about in my
previous email.  But I guess it's not fatal to the idea and there is a
potential partial mitigation.  (See below).

A third thing was the requirement for communication between
StartReadBuffers() and CompleteReadBuffers() which I originally had an
"opaque" object that the caller has to keep around that held private
state.  It seemed nice to go back to talking just about buffer
numbers, but that's not really an argument for anything...

OK, I'm going to try the two-step version (again) with interfaces
along the lines you sketched out...  more soon.

> I see. When you're about to zero the page, there's not much point in
> splitting the operation into Prepare/Start/Complete stages anyway.
> You're not actually doing any I/O. Perhaps it's best to have a separate
> "Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"
> function that does the same as
> ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

That makes sense, but... hmm, sometimes just allocating a page
generates I/O if it has to evict a dirty buffer.  Nothing in this code
does anything fancy about that, but imagine some hypothetical future
thing that manages to do that asynchronously -- then we might want to
take advantage of the ability to stream even a zeroed page, ie doing
something ahead of time?  Just a thought for another day, and perhaps
that is just an argument for including it in the streaming read API,
but it doesn't mean that the bufmgr.c API can't be as you say.

> One weakness is that if StartReadBufferRange() finds that the range is
> "chopped up", it needs to return and throw away the work it had to do to
> look up the next buffer. So in the extreme case that every other block
> in the range is in the buffer cache, each call would look up two buffers
> in the buffer cache, startBlk and startBlk + 1, but only return one
> buffer to the caller.

Yeah, right.  This was one of the observations that influenced my
PrepareReadBuffer() three-step thing that I'd forgotten.  To spell
that out with an example, suppose the buffer pool contains every odd
numbered block.  Successive StartReadBuffers() calls would process
"HMHm", "MHm", "MHm"... where "m" represents a miss that we can't do
anything with for a block we'll look up in the buffer pool again in
the next call.  With the PrepareReadBuffer() design, that miss just
starts a new range and we don't have to look it up again.  Hmm, I
suppose that could be mitigated somewhat with ReadRecentBuffer() if we
can find somewhere decent to store it.

BTW it was while thinking about and testing cases like that that I
found Palak Chaturvedi's https://commitfest.postgresql.org/46/4426/
extremely useful.  It can kick out every second page or any other
range-chopping scenario you can express in a WHERE clause.  I would
quite like to get that tool into the tree...



Re: Streaming I/O, vectored I/O (WIP)

From
Nazir Bilal Yavuz
Date:
Hi,

Thanks for working on this!

On Wed, 10 Jan 2024 at 07:14, Thomas Munro <thomas.munro@gmail.com> wrote:

> Thanks!  I committed the patches up as far as smgr.c before the
> holidays.  The next thing to commit would be the bufmgr.c one for
> vectored ReadBuffer(), v5-0001.  Here's my response to your review of
> that, which triggered quite a few changes.
>
> See also new version of the streaming_read.c patch, with change list
> at end.  (I'll talk about v5-0002, the SMgrRelation lifetime one, over
> on the separate thread about that where Heikki posted a better
> version.)

I have a couple of comments / questions.

0001-Provide-vectored-variant-of-ReadBuffer:

- Do we need to pass the hit variable to ReadBuffer_common()? I think
it can be just declared in the ReadBuffer_common() now.


0003-Provide-API-for-streaming-reads-of-relations:

- Do we need to re-think about getting a victim buffer logic?
StrategyGetBuffer() function errors if it can not find any unpinned
buffers, this can be more common in the async world since we pin
buffers before completing the read (while looking ahead).

- If the returned block from the callback is an invalid block,
pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
be cases like the returned block being an invalid block but we should
continue to read after this invalid block?

- max_pinned_buffers and pinned_buffers_trigger variables are set in
the initialization part (in the
pg_streaming_read_buffer_alloc_internal() function) then they do not
change. In some cases there could be no acquirable buffers to pin
while initializing the pgsr (LimitAdditionalPins() set
max_pinned_buffers to 1) but while the read is continuing there could
be chances to create larger reads (other consecutive reads are
finished while this read is continuing). Do you think that trying to
reset max_pinned_buffers and pinned_buffers_trigger to have higher
values after the initialization to have larger reads make sense?

+        /* Is there a head range that we can't extend? */
+        head_range = &pgsr->ranges[pgsr->head];
+        if (head_range->nblocks > 0 &&
+            (!need_complete ||
+             !head_range->need_complete ||
+             head_range->blocknum + head_range->nblocks != blocknum))
+        {
+            /* Yes, time to start building a new one. */
+            head_range = pg_streaming_read_new_range(pgsr);

- I think if both need_complete and head_range->need_complete are
false, we can extend the head range regardless of the consecutiveness
of the blocks.


0006-Allow-streaming-reads-to-ramp-up-in-size:

- ramp_up_pin_limit variable is declared as an int but we do not check
the overflow while doubling it. This could be a problem in longer
reads.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Wed, Feb 7, 2024 at 11:54 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> 0001-Provide-vectored-variant-of-ReadBuffer:
>
> - Do we need to pass the hit variable to ReadBuffer_common()? I think
> it can be just declared in the ReadBuffer_common() now.

Right, thanks!  Done, in the version I'll post shortly.

> 0003-Provide-API-for-streaming-reads-of-relations:
>
> - Do we need to re-think about getting a victim buffer logic?
> StrategyGetBuffer() function errors if it can not find any unpinned
> buffers, this can be more common in the async world since we pin
> buffers before completing the read (while looking ahead).

Hmm, well that is what the pin limit machinery is supposed to be the
solution to.  It has always been possible to see that error if
shared_buffers is too small, so that your backends can't pin what they
need to make progress because there are too many other backends, the
whole buffer pool is pinned and there is nothing available to
steal/evict.  Here, sure, we pin more stuff per backend, but not more
than a "fair share", that is, Buffers / max backends, so it's not any
worse, is it?  Well maybe it's marginally worse in some case, for
example if a query that uses many streams has one pinned buffer per
stream (which we always allow) where before we'd have acquired and
released pins in a slightly different sequence or whatever, but there
is always going to be a minimum shared_buffers that will work at all
for a given workload and we aren't changing it by much if at all here.
If you're anywhere near that limit, your performance must be so bad
that it'd only be a toy setting anyway.  Does that sound reasonable?

Note that this isn't the first to use multi-pin logic or that limit
mechanism: that was the new extension code that shipped in 16.  This
will do that more often, though.

> - If the returned block from the callback is an invalid block,
> pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
> be cases like the returned block being an invalid block but we should
> continue to read after this invalid block?

Yeah, I think there will be, and I think we should do it with some
kind of reset/restart function.  I don't think we need it for the
current users so I haven't included it yet (there is a maybe-related
discussion about reset for another reasons, I think Melanie has an
idea about that), but I think something like that will useful for
future stuff like streaming recovery, where you can run out of WAL to
read but more will come via the network soon.

> - max_pinned_buffers and pinned_buffers_trigger variables are set in
> the initialization part (in the
> pg_streaming_read_buffer_alloc_internal() function) then they do not
> change. In some cases there could be no acquirable buffers to pin
> while initializing the pgsr (LimitAdditionalPins() set
> max_pinned_buffers to 1) but while the read is continuing there could
> be chances to create larger reads (other consecutive reads are
> finished while this read is continuing). Do you think that trying to
> reset max_pinned_buffers and pinned_buffers_trigger to have higher
> values after the initialization to have larger reads make sense?

That sounds hard!  You're right that in the execution of a query there
might well be cases like that (inner and outer scan of a hash join
don't actually run at the same time, likewise for various other plan
shapes), and something that would magically and dynamically balance
resource usage might be ideal, but I don't know where to begin.
Concretely, as long as your buffer pool is measured in gigabytes and
your max backends is measured in hundreds, the per backend pin limit
should actually be fairly hard to hit anyway, as it would be in the
thousands.  So I don't think it is as important as other resource
usage balance problems that we also don't attempt (memory, CPU, I/O
bandwidth).

> +        /* Is there a head range that we can't extend? */
> +        head_range = &pgsr->ranges[pgsr->head];
> +        if (head_range->nblocks > 0 &&
> +            (!need_complete ||
> +             !head_range->need_complete ||
> +             head_range->blocknum + head_range->nblocks != blocknum))
> +        {
> +            /* Yes, time to start building a new one. */
> +            head_range = pg_streaming_read_new_range(pgsr);
>
> - I think if both need_complete and head_range->need_complete are
> false, we can extend the head range regardless of the consecutiveness
> of the blocks.

Yeah, I think we can experiment with ideas like that.  Not done yet
but I'm thinking about it -- more shortly.

> 0006-Allow-streaming-reads-to-ramp-up-in-size:
>
> - ramp_up_pin_limit variable is declared as an int but we do not check
> the overflow while doubling it. This could be a problem in longer
> reads.

But it can't get above very high, because eventually it exceeds
max_pinned_buffers, which is anchored to the ground by various small
limits.



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Fri, Jan 12, 2024 at 12:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Ok. It feels surprising to have three steps. I understand that you need
> > two steps, one to start the I/O and another to wait for them to finish,
> > but why do you need separate Prepare and Start steps? What can you do in
> > between them? (You explained that. I'm just saying that that's my
> > initial reaction when seeing that API. It is surprising.)
[...]
> OK, I'm going to try the two-step version (again) with interfaces
> along the lines you sketched out...  more soon.

Here's the 2 step version.  The streaming_read.c API is unchanged, but
the bugmfr.c API now has only the following extra functions:

  bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
  WaitReadBuffers(ReadBuffersOperation *op)

That is, the PrepareReadBuffer() step is gone.

StartReadBuffers() updates *nblocks to the number actually processed,
which is always at least one.  If it returns true, then you must call
WaitReadBuffers().  When it finds a 'hit' (doesn't need I/O), then
that one final (or only) buffer is processed, but no more.
StartReadBuffers() always conceptually starts 0 or 1 I/Os.  Example:
if you ask for 16 blocks, and it finds two misses followed by a hit,
it'll set *nblocks = 3, smgrprefetch(2 blocks), and smgrreadv(2
blocks) them in WaitReadBuffer().  The caller can't really tell that
the third block was a hit.  The only case it can distinguish is if the
first one was a hit, and then it returns false and sets *nblocks = 1.

This arrangement, where the results include the 'boundary' block that
ends the readable range, avoids the double-lookup problem we discussed
upthread.  I think it should probably also be able to handle multiple
consecutive 'hits' at the start of a sequence, but in this version I
kept it simpler.  It couldn't ever handle more than one after an I/O
range though, because it can't guess if the one after will be a hit or
a miss.  If it turned out to be a miss, we don't want to start a
second I/O, so unless we decide that we're happy unpinning and
re-looking-up next time, it's better to give up then.  Hence the idea
of including the hit as a bonus block on the end.

It took me a long time but I eventually worked my way around to
preferring this way over the 3 step version.  streaming_read.c now has
to do a bit more work including sometimes 'ungetting' a block (ie
deferring one that the callback has requested until next time), to
resolve some circularities that come up with flow control.  But I
suspect you'd probably finish up having to deal with 'short' writes
anyway, because in the asynchronous future, in a three-step version,
the StartReadBuffers() (as 2nd step) might also be short when it fails
to get enough BM_IO_IN_PROGRESS flags, so you have to deal with some
version of these problems anyway.  Thoughts?

I am still thinking about how to improve the coding in
streaming_read.c, ie to simplify and beautify the main control loop
and improve the flow control logic.  And looking for interesting test
cases to hit various conditions in it and try to break it.  And trying
to figure out how this read-coalescing and parallel seq scan's block
allocator might interfere with each other to produce non-idea patterns
of system calls.

Here are some example strace results generated by a couple of simple
queries.  See CF #4426 for pg_buffercache_invalidate().

=== Sequential scan example ===

create table big as select generate_series(1, 10000000);

select count(*) from big;

pread64(81, ...) = 8192   <-- starts small
pread64(81, ...) = 16384
pread64(81, ...) = 32768
pread64(81, ...) = 65536
pread64(81, ...) = 131072 <-- fully ramped up size reached
preadv(81, ...) = 131072  <-- more Vs seen as buffers fill up/fragments
preadv(81, ...) = 131072
...repeating...
preadv(81, ...) = 131072
preadv(81, ...) = 131072
pread64(81, ...) = 8192   <-- end fragment

create table small as select generate_series(1, 100000);

select bool_and(pg_buffercache_invalidate(bufferid))
  from pg_buffercache
 where relfilenode = pg_relation_filenode('small')
   and relblocknumber % 3 != 0;  -- <-- kick out every 3rd block

select count(*) from small;

preadv(88, ...) = 16384  <-- just the 2-block fragments we need to load
preadv(88, ...) = 16384
preadv(88, ...) = 16384

=== Bitmap heapscan example ===

create table heap (i int primary key);
insert into heap select generate_series(1, 1000000);

select bool_and(pg_buffercache_invalidate(bufferid))
 from pg_buffercache
where relfilenode = pg_relation_filenode('heap');

select count(i) from heap where i in (10, 1000, 10000, 100000) or i in
(20, 200, 2000, 200000);

pread64(75, ..., 8192, 0) = 8192
fadvise64(75, 32768, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 65536, 8192, POSIX_FADV_WILLNEED) = 0
pread64(75, ..., 8192, 32768) = 8192
pread64(75, ..., 8192, 65536) = 8192
fadvise64(75, 360448, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 3620864, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 7241728, 8192, POSIX_FADV_WILLNEED) = 0
pread64(75, ..., 8192, 360448) = 8192
pread64(75, ..., 8192, 3620864) = 8192
pread64(75, ..., 8192, 7241728) = 8192

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Robert Haas
Date:
On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's the 2 step version.  The streaming_read.c API is unchanged, but
> the bugmfr.c API now has only the following extra functions:
>
>   bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
>   WaitReadBuffers(ReadBuffersOperation *op)

I wonder if there are semi-standard names that people use for this
kind of API. Somehow I like "start" and "finish" or "start" and
"complete" better than "start" and "wait". But I don't really know
what's best. If there's a usual practice, it'd be good to adhere to
it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Tue, Feb 27, 2024 at 5:03 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Here's the 2 step version.  The streaming_read.c API is unchanged, but
> > the bugmfr.c API now has only the following extra functions:
> >
> >   bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
> >   WaitReadBuffers(ReadBuffersOperation *op)
>
> I wonder if there are semi-standard names that people use for this
> kind of API. Somehow I like "start" and "finish" or "start" and
> "complete" better than "start" and "wait". But I don't really know
> what's best. If there's a usual practice, it'd be good to adhere to
> it.

I think "complete" has a subtly different meaning, which is why I
liked Heikki's suggestion.  One way to look at it is that only one
backend can "complete" a read, but any number of backends can "wait".
But I don't have strong views on this.  It feels like this API is
relatively difficult to use directly anyway, so almost all users will
go through the streaming read API.

Here's a new version with two main improvements.  (Note that I'm only
talking about 0001-0003 here, the rest are useful for testing but are
just outdated versions of patches that have their own threads.)

1.  Melanie discovered small regressions in all-cached simple scans.
Here's a better look-ahead distance control algorithm that addresses
that.  First let me state the updated goals of the algorithm:

 A) for all-hit scans, pin very few buffers, since that can't help if
we're not doing any I/O
 B) for all-miss sequential scans, pin only as many buffers as it
takes to build full-sized I/Os, since fully sequential scans are left
to the OS to optimise for now (ie no "advice")
 C) for all-miss random scans, pin as many buffers as it takes to
reach our I/O concurrency level

In all cases, respect the per-backend pin limit as a last resort limit
(roughly NBuffers / max_connections), but we're now actively trying
*not* to use so many.

For patterns in between the A, B, C extremes, do something in between.
The way the new algorithm attempts to classify the scan adaptively
over time is as follows:

 * look ahead distance starts out at one block (behaviour A)
 * every time we start an I/O, we double the distance until we reach
the max pin limit (behaviour C), or if we're not issuing "advice"
because sequential access is detected, until we reach the
MAX_TRANSFER_BUFFERS (behaviour B)
 * every time we get a hit, we decrement the distance by one (we move
slowly back to behaviour A)

Query to observe a system transitioning A->B->A->B, when doing a full
scan that has ~50 contiguous blocks already in shared buffers
somewhere in the middle:

create extension pg_prewarm;
create extension pg_buffercache;
set max_parallel_workers_per_gather = 0;

create table t (i int);
insert into t select generate_series(1, 100000);
select pg_prewarm('t');
select bool_and(pg_buffercache_invalidate(bufferid))
  from pg_buffercache
 where relfilenode = pg_relation_filenode('t')
   and (relblocknumber between 0 and 100 or relblocknumber > 150);

select count(*) from t;

pread(31,...) = 8192 (0x2000)      <--- start small (A)
preadv(31,...) = 16384 (0x4000)    <--- ramp up...
preadv(31,...) = 32768 (0x8000)
preadv(31,...) = 65536 (0x10000)
preadv(31,...) = 131072 (0x20000)  <--- full size (B)
preadv(31,...) = 131072 (0x20000)
...
preadv(31,...) = 131072 (0x20000)
preadv(31,...) = 49152 (0xc000)    <--- end of misses, decay to A
pread(31,...) = 8192 (0x2000)      <--- start small again (A)
preadv(31,...) = 16384 (0x4000)
preadv(31,...) = 32768 (0x8000)
preadv(31,...) = 65536 (0x10000)
preadv(31,...) = 131072 (0x20000)  <--- full size (B)
preadv(31,...) = 131072 (0x20000)
...
preadv(31,...) = 131072 (0x20000)
preadv(31,...) = 122880 (0x1e000)  <-- end of relation

Note that this never pins more than 16 buffers, because it's case B,
not case C in the description above.  There is no benefit in looking
further ahead if you're relying on the kernel's sequential
prefetching.

The fully cached regression Melanie reported now stays entirely in A.
The previous coding would ramp up to high look-ahead distance for no
good reason and never ramp down, so it was wasteful and slightly
slower than master.

2.  Melanie and I noticed while discussing the pre-existing ad hoc
bitmap heap scan that this thing should respect the
{effective,maintenance}_io_concurrency setting of the containing
tablespace.  Note special cases to avoid problems while stream-reading
pg_tablespace itself and pg_database in backend initialisation.

There is a third problem that I'm still working on: the behaviour for
very small values of effective_io_concurrency isn't quite right, as
discussed in detail by Tomas and Melanie on the bitmap heapscan
thread.  The attached makes 0 do approximately the right thing (though
I'm hoping to make it less special), but other small numbers aren't
quite right yet -- 1 is still issuing a useless fadvise at the wrong
time, and 2 is working in groups of N at a time instead of
interleaving as you might perhaps expect.  These are probably
side-effects of my focusing on coalescing large reads and losing sight
of the small cases.  I need a little more adaptivity and generality in
the algorithm at the small end, not least because 1 is the default
value.  I'll share a patch to improve that very soon.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Dilip Kumar
Date:
On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
Hi Thomas,

I am planning to review this patch set, so started going through 0001,
I have a question related to how we are issuing smgrprefetch in
StartReadBuffers()

+ if (operation->io_buffers_len > 0)
+ {
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
  {
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- MemSet((char *) bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above.  That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity.  Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
  }

 This is always issuing smgrprefetch starting with the input blockNum,
shouldn't we pass the first blockNum which we did not find in the
 Buffer pool?  So basically in the loop above this call where we are
doing PrepareReadBuffer() we should track the first blockNum for which
 the found is not true and pass that blockNum into the smgrprefetch()
as a first block right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Tue, Mar 12, 2024 at 7:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I am planning to review this patch set, so started going through 0001,
> I have a question related to how we are issuing smgrprefetch in
> StartReadBuffers()

Thanks!

> + /*
> + * In theory we should only do this if PrepareReadBuffers() had to
> + * allocate new buffers above.  That way, if two calls to
> + * StartReadBuffers() were made for the same blocks before
> + * WaitReadBuffers(), only the first would issue the advice.
> + * That'd be a better simulation of true asynchronous I/O, which
> + * would only start the I/O once, but isn't done here for
> + * simplicity.  Note also that the following call might actually
> + * issue two advice calls if we cross a segment boundary; in a
> + * true asynchronous version we might choose to process only one
> + * real I/O at a time in that case.
> + */
> + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
>   }
>
>  This is always issuing smgrprefetch starting with the input blockNum,
> shouldn't we pass the first blockNum which we did not find in the
>  Buffer pool?  So basically in the loop above this call where we are
> doing PrepareReadBuffer() we should track the first blockNum for which
>  the found is not true and pass that blockNum into the smgrprefetch()
> as a first block right?

I think you'd be right if StartReadBuffers() were capable of
processing a sequence consisting of a hit followed by misses, but
currently it always gives up after the first hit.  That is, it always
processes some number of misses (0-16) and then at most one hit.  So
for now the variable would always turn out to be the same as blockNum.

The reason is that I wanted to allows "full sized" read system calls
to form.  If you said "hey please read these 16 blocks" (I'm calling
that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
then it could only form a read of 14 blocks, but there might be more
blocks that could be read after those.  We would have some arbitrary
shorter read system calls, when we wanted to make them all as big as
possible.  So in the current patch you say "hey please read these 16
blocks" and it returns saying "only read 1", you call again with 15
and it says "only read 1", and you call again and says "read 16!"
(assuming 2 more were readable after the original range we started
with).  Then physical reads are maximised.  Maybe there is some nice
way to solve that, but I thought this way was the simplest (and if
there is some instruction-cache-locality/tight-loop/perf reason why we
should work harder to find ranges of hits, it could be for later).
Does that make sense?



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Tue, Mar 12, 2024 at 7:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> possible.  So in the current patch you say "hey please read these 16
> blocks" and it returns saying "only read 1", you call again with 15

Oops, typo worth correcting: s/15/16/.  Point being that the caller is
interested in more blocks after the original 16, so it uses 16 again
when it calls back (because that's the size of the Buffer array it
provides).



Re: Streaming I/O, vectored I/O (WIP)

From
Dilip Kumar
Date:
On Tue, Mar 12, 2024 at 12:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I think you'd be right if StartReadBuffers() were capable of
> processing a sequence consisting of a hit followed by misses, but
> currently it always gives up after the first hit.  That is, it always
> processes some number of misses (0-16) and then at most one hit.  So
> for now the variable would always turn out to be the same as blockNum.
>
Okay, then shouldn't this "if (found)" block immediately break the
loop so that when we hit the block we just return that block?  So it
makes sense what you explained but with the current code if there are
the first few hits followed by misses then we will issue the
smgrprefetch() for the initial hit blocks as well.

+ if (found)
+ {
+ /*
+ * Terminate the read as soon as we get a hit.  It could be a
+ * single buffer hit, or it could be a hit that follows a readable
+ * range.  We don't want to create more than one readable range,
+ * so we stop here.
+ */
+ actual_nblocks = operation->nblocks = *nblocks = i + 1;    (Dilip: I
think we should break after this?)
+ }
+ else
+ {
+ /* Extend the readable range to cover this block. */
+ operation->io_buffers_len++;
+ }
+ }

> The reason is that I wanted to allows "full sized" read system calls
> to form.  If you said "hey please read these 16 blocks" (I'm calling
> that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
> then it could only form a read of 14 blocks, but there might be more
> blocks that could be read after those.  We would have some arbitrary
> shorter read system calls, when we wanted to make them all as big as
> possible.  So in the current patch you say "hey please read these 16
> blocks" and it returns saying "only read 1", you call again with 15
> and it says "only read 1", and you call again and says "read 16!"
> (assuming 2 more were readable after the original range we started
> with).  Then physical reads are maximised.  Maybe there is some nice
> way to solve that, but I thought this way was the simplest (and if
> there is some instruction-cache-locality/tight-loop/perf reason why we
> should work harder to find ranges of hits, it could be for later).
> Does that make sense?

Understood, I think this makes sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Tue, Mar 12, 2024 at 11:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> + actual_nblocks = operation->nblocks = *nblocks = i + 1;
> (Dilip: I think we should break after this?)

In the next loop, i < actual_nblocks is false so the loop terminates.
But yeah that was a bit obscure, so I have added an explicit break.

Here also is a new version also of the streaming_read.c patch.  This
is based on feedback from the bitmap heap scan thread, where Tomas and
Melanie noticed some problems when comparing effective_io_concurrency
= 0 and 1 against master.  The attached random.sql exercises a bunch
of random scans with different settings, and random.txt shows the
resulting system calls, unpatched vs patched.  Looking at it that way,
I was able to tweak the coding until I had the same behaviour as
master, except with fewer system calls wherever possible due to
coalescing or suppressing sequential advice.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
I am planning to push the bufmgr.c patch soon.  At that point the new
API won't have any direct callers yet, but the traditional
ReadBuffer() family of functions will internally reach
StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
ZeroBuffer() or nothing as appropriate.  Any more thoughts or
objections?  Naming, semantics, correctness of buffer protocol,
sufficiency of comments, something else?



Re: Streaming I/O, vectored I/O (WIP)

From
Nazir Bilal Yavuz
Date:
Hi,

On Sat, 16 Mar 2024 at 02:53, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I am planning to push the bufmgr.c patch soon.  At that point the new
> API won't have any direct callers yet, but the traditional
> ReadBuffer() family of functions will internally reach
> StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
> ZeroBuffer() or nothing as appropriate.  Any more thoughts or
> objections?  Naming, semantics, correctness of buffer protocol,
> sufficiency of comments, something else?

+    if (StartReadBuffers(bmr,
+                         &buffer,
+                         forkNum,
+                         blockNum,
+                         &nblocks,
+                         strategy,
+                         flags,
+                         &operation))
+        WaitReadBuffers(&operation);

I think we need to call WaitReadBuffers when 'mode !=
RBM_ZERO_AND_CLEANUP_LOCK && mode != RBM_ZERO_AND_LOCK' or am I
missing something?

Couple of nitpicks:

It would be nice to explain what the PrepareReadBuffer function does
with a comment.

+    if (nblocks == 0)
+        return;                    /* nothing to do */
It is guaranteed that nblocks will be bigger than 0. Can't we just use
Assert(operation->io_buffers_len > 0);?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
Some quick comments:

On 12/03/2024 15:02, Thomas Munro wrote:
> src/backend/storage/aio/streaming_read.c
> src/include/storage/streaming_read.h

Standard file header comments missing.

It would be nice to have a comment at the top of streaming_read.c, 
explaining at a high level how the circular buffer, lookahead and all 
that works. Maybe even some diagrams.

For example, what is head and what is tail? Before reading the code, I 
assumed that 'head' was the next block range to return in 
pg_streaming_read_buffer_get_next(). But I think it's actually the other 
way round?

> /*
>  * Create a new streaming read object that can be used to perform the
>  * equivalent of a series of ReadBuffer() calls for one fork of one relation.
>  * Internally, it generates larger vectored reads where possible by looking
>  * ahead.
>  */
> PgStreamingRead *
> pg_streaming_read_buffer_alloc(int flags,
>                                void *pgsr_private,
>                                size_t per_buffer_data_size,
>                                BufferAccessStrategy strategy,
>                                BufferManagerRelation bmr,
>                                ForkNumber forknum,
>                                PgStreamingReadBufferCB next_block_cb)

I'm not a fan of the name, especially the 'alloc' part. Yeah, most of 
the work it does is memory allocation. But I'd suggest something like 
'pg_streaming_read_begin' instead.

Do we really need the pg_ prefix in these?

> Buffer
> pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)

Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', 
for a shorter name.


> 
>     /*
>      * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
>      * When it is full, there is an empty element between head and tail.  Head
>      * can also be empty (nblocks == 0), therefore we need two extra elements
>      * for non-occupied ranges, on top of max_pinned_buffers to allow for the
>      * maxmimum possible number of occupied ranges of the smallest possible
>      * size of one.
>      */
>     size = max_pinned_buffers + 2;

I didn't understand this explanation for why it's + 2.

>     /*
>      * Skip the initial ramp-up phase if the caller says we're going to be
>      * reading the whole relation.  This way we start out doing full-sized
>      * reads.
>      */
>     if (flags & PGSR_FLAG_FULL)
>         pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
>     else
>         pgsr->distance = 1;

Should this be "Max(MAX_BUFFERS_PER_TRANSFER, 
pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than 
MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So 
perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 12/03/2024 15:02, Thomas Munro wrote:
> > src/backend/storage/aio/streaming_read.c
> > src/include/storage/streaming_read.h
>
> Standard file header comments missing.

Fixed.

> It would be nice to have a comment at the top of streaming_read.c,
> explaining at a high level how the circular buffer, lookahead and all
> that works. Maybe even some diagrams.

Done.

> For example, what is head and what is tail? Before reading the code, I
> assumed that 'head' was the next block range to return in
> pg_streaming_read_buffer_get_next(). But I think it's actually the other
> way round?

Yeah.  People seem to have different natural intuitions about head vs
tail in this sort of thing, so I've switched to descriptive names:
stream->{oldest,next}_buffer_index (see also below).

> > /*
> >  * Create a new streaming read object that can be used to perform the
> >  * equivalent of a series of ReadBuffer() calls for one fork of one relation.
> >  * Internally, it generates larger vectored reads where possible by looking
> >  * ahead.
> >  */
> > PgStreamingRead *
> > pg_streaming_read_buffer_alloc(int flags,
> >                                                          void *pgsr_private,
> >                                                          size_t per_buffer_data_size,
> >                                                          BufferAccessStrategy strategy,
> >                                                          BufferManagerRelation bmr,
> >                                                          ForkNumber forknum,
> >                                                          PgStreamingReadBufferCB next_block_cb)
>
> I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
> the work it does is memory allocation. But I'd suggest something like
> 'pg_streaming_read_begin' instead.

I like it.  Done.

> Do we really need the pg_ prefix in these?

Good question.  My understanding of our convention is that pg_ is
needed for local replacements/variants/extensions of things with well
known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
a few places where the word is very common/short and we want to avoid
collisions and make sure it's obviously ours (pg_popcount?), and I
guess places that reflect the name of a SQL identifier with a prefix,
but this doesn't seem to qualify for any of those things.  It's a new
thing, our own thing entirely, and sufficiently distinctive and
unconfusable with standard stuff.  So, prefix removed.

Lots of other patches on top of this one are using "pgsr" as a
variable name, ie containing that prefix; perhaps they would use  "sr"
or "streaming_read" or "stream".  I used "stream" in a few places in
this version.

Other names improved in this version IMHO: pgsr_private ->
callback_private.  I find it clearer, as a way to indicate that the
provider of the callback "owns" it.  I also reordered the arguments:
now it's streaming_read_buffer_begin(..., callback, callback_private,
per_buffer_data_size), to keep those three things together.

> > Buffer
> > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
>
> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
> for a shorter name.

Hmm.  The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers.  I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions?  What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version.  Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.

It's also a bit grammatically weird to say StartReadBuffers() and
WaitReadBuffers() in the bufmgr API...  Hmm.  Perhaps we should just
call it ReadBuffers() and WaitForBufferIO()?  Maybe surprising because
the former isn't just like ReadBuffer() ... but on the other hand no
one said it has to be, and sometimes it even is (when it gets a hit).
I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite
to make the asynchrony optional or explicit if someone has a problem
with that.

(Hmm, that'd be a bit like the Windows native file API, where
ReadFile() is synchronous or asynchronous depending on flags.)

> >
> >       /*
> >        * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
> >        * When it is full, there is an empty element between head and tail.  Head
> >        * can also be empty (nblocks == 0), therefore we need two extra elements
> >        * for non-occupied ranges, on top of max_pinned_buffers to allow for the
> >        * maxmimum possible number of occupied ranges of the smallest possible
> >        * size of one.
> >        */
> >       size = max_pinned_buffers + 2;
>
> I didn't understand this explanation for why it's + 2.

I think the logic was right but the explanation was incomplete, sorry.
It needed one gap between head and tail because head == tail means
empty, and another because the tail item was still 'live' for one
extra call until you examined it one more time to notice that all the
buffers had been extracted.  In any case I have now deleted all that.
In this new version I don't need extra space between head and tail at
all, because empty is detected with stream->pinned_buffers == 0, and
tail moves forwards immediately when you consume buffers.

> >       /*
> >        * Skip the initial ramp-up phase if the caller says we're going to be
> >        * reading the whole relation.  This way we start out doing full-sized
> >        * reads.
> >        */
> >       if (flags & PGSR_FLAG_FULL)
> >               pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
> >       else
> >               pgsr->distance = 1;
>
> Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
> pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
> MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
> perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?

Right, done.

Here are some other changes:

* I'm fairly happy with the ABC adaptive distance algorithm so far, I
think, but I spent more time tidying up the way it is implemented.  I
didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
so I created a new dense array stream->buffers that behaved as a
second circular queue.

* The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
become the GUC that it always wanted to be: buffer_io_size defaulting
to 128kB.  Seems like a reasonable thing to have?  Could also
influence things like bulk write?  (The main problem I have with the
GUC currently is choosing a category, async resources is wrong....)

* By analogy, it started to look a bit funny that each range had room
for a ReadBuffersOperation, and we had enough ranges for
max_pinned_buffers * 1 block range.  So I booted that out to another
dense array, of size max_ios.

* At the same time, Bilal and Andres had been complaining privately
about 'range' management overheads showing up in perf and creating a
regression against master on fully cached scans that do nothing else
(eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
and no CPU work with the page, a somewhat extreme case but a
reasonable way to isolate the management costs); having made the above
change, it suddenly seemed obvious that I should make the buffers
array the 'main' circular queue, pointing off to another place for
information required for dealing with misses.  In this version, there
are no more range objects.  This feels better and occupies and touches
less memory.  See pictures below.

* The 'head range' is replaced by the pending_read_{blocknum,nblocks}
variables, which seems easier to understand.  Essentially the
callback's block numbers are run-length compressed into there until
they can't be, at which point we start a read and start forming a new
pending read.

* A micro-optimisation arranges for the zero slots to be reused over
and over again if we're doing distance = 1, to avoid rotating through
memory for no benefit; I don't yet know if that pays for itself, it's
just a couple of lines...

* Various indexes and sizes that couldn't quite fit in uint8_t but
couldn't possibly exceed a few thousand because they are bounded by
numbers deriving from range-limited GUCs are now int16_t (while I was
looking for low hanging opportunities to reduce memory usage...)

In pictures, the circular queue arrangement changed from
max_pinned_buffers * fat range objects like this, where only the
per-buffer data was outside the range object (because its size is
variable), and in the worst case all-hit case there were a lot of
ranges with only one buffer in them:

                            ranges                             buf/data

                            +--------+--------------+----+      +-----+
                            |        |              |    |      |     |
                            +--------+--------------+----+      +-----+
                    tail -> | 10..10 | buffers[MAX] | op +----->|  ?  |
                            +--------+--------------+----+      +-----+
                            | 42..44 | buffers[MAX] | op +----->|  ?  |
                            +--------+--------------+----+      +-----+
                            | 60..60 | buffers[MAX] | op +--+   |  ?  |
                            +--------+--------------+----+  |   +-----+
                    head -> |        |              |    |  |   |  ?  |
                            +--------+--------------+----+  |   +-----+
                            |        |              |    |  +-->|  ?  |
                            +--------+--------------+----+      +-----+
                            |        |              |    |      |     |
                            +--------+--------------+----+      +-----+

... to something that you might call a "columnar" layout, where ops
are kicked out to their own array of size max_ios (a much smaller
number), and the buffers, per-buffer data and indexes pointing to
optional I/O objects are in parallel arrays of size
max_pinned_buffers, like this:

                            buffers buf/data buf/io       ios (= ops)

                            +----+  +-----+  +---+        +--------+
                            |    |  |     |  |   |  +---->| 42..44 |
                            +----+  +-----+  +---+  |     +--------+
     oldest_buffer_index -> | 10 |  |  ?  |  |   |  | +-->| 60..60 |
                            +----+  +-----+  +---+  | |   +--------+
                            | 42 |  |  ?  |  | 0 +--+ |   |        |
                            +----+  +-----+  +---+    |   +--------+
                            | 43 |  |  ?  |  |   |    |   |        |
                            +----+  +-----+  +---+    |   +--------+
                            | 44 |  |  ?  |  |   |    |   |        |
                            +----+  +-----+  +---+    |   +--------+
                            | 60 |  |  ?  |  | 1 +----+
                            +----+  +-----+  +---+
       next_buffer_index -> |    |  |     |  |   |
                            +----+  +-----+  +---+

In other words, there is essentially no waste/padding now, and in the
all-hit case we stay in the zero'th elements of those arrays so they
can stay red hot.  Still working on validating this refactoring with
other patches and test scenarios.  I hope it's easier to understand,
and does a better job of explaining itself.

I'm also still processing a bunch of performance-related fixups mostly
for bufmgr.c sent by Andres off-list (things like: StartReadBuffer()
argument list is too wide, some things need inline, we should only
initialise the op if it will be needed, oh I squashed that last one
into the patch already), after he and Bilal studied some regressions
in cases with no I/O.  And thinking about Bilal's earlier message
(extra read even when we're going to zero, oops, he's quite right
about that) and a patch he sent me for that.  More on those soon.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Melanie Plageman
Date:
On Sun, Mar 24, 2024 at 9:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > On 12/03/2024 15:02, Thomas Munro wrote:
> > > src/backend/storage/aio/streaming_read.c
> > > src/include/storage/streaming_read.h
> >
> > Standard file header comments missing.
>
> Fixed.
>
> > It would be nice to have a comment at the top of streaming_read.c,
> > explaining at a high level how the circular buffer, lookahead and all
> > that works. Maybe even some diagrams.
>
> Done.
>
> > For example, what is head and what is tail? Before reading the code, I
> > assumed that 'head' was the next block range to return in
> > pg_streaming_read_buffer_get_next(). But I think it's actually the other
> > way round?
>
> Yeah.  People seem to have different natural intuitions about head vs
> tail in this sort of thing, so I've switched to descriptive names:
> stream->{oldest,next}_buffer_index (see also below).
>
> > > /*
> > >  * Create a new streaming read object that can be used to perform the
> > >  * equivalent of a series of ReadBuffer() calls for one fork of one relation.
> > >  * Internally, it generates larger vectored reads where possible by looking
> > >  * ahead.
> > >  */
> > > PgStreamingRead *
> > > pg_streaming_read_buffer_alloc(int flags,
> > >                                                          void *pgsr_private,
> > >                                                          size_t per_buffer_data_size,
> > >                                                          BufferAccessStrategy strategy,
> > >                                                          BufferManagerRelation bmr,
> > >                                                          ForkNumber forknum,
> > >                                                          PgStreamingReadBufferCB next_block_cb)
> >
> > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
> > the work it does is memory allocation. But I'd suggest something like
> > 'pg_streaming_read_begin' instead.
>
> I like it.  Done.
>
> > Do we really need the pg_ prefix in these?
>
> Good question.  My understanding of our convention is that pg_ is
> needed for local replacements/variants/extensions of things with well
> known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
> a few places where the word is very common/short and we want to avoid
> collisions and make sure it's obviously ours (pg_popcount?), and I
> guess places that reflect the name of a SQL identifier with a prefix,
> but this doesn't seem to qualify for any of those things.  It's a new
> thing, our own thing entirely, and sufficiently distinctive and
> unconfusable with standard stuff.  So, prefix removed.
>
> Lots of other patches on top of this one are using "pgsr" as a
> variable name, ie containing that prefix; perhaps they would use  "sr"
> or "streaming_read" or "stream".  I used "stream" in a few places in
> this version.
>
> Other names improved in this version IMHO: pgsr_private ->
> callback_private.  I find it clearer, as a way to indicate that the
> provider of the callback "owns" it.  I also reordered the arguments:
> now it's streaming_read_buffer_begin(..., callback, callback_private,
> per_buffer_data_size), to keep those three things together.

I haven't reviewed the whole patch, but as I was rebasing
bitmapheapscan streaming read user, I found callback_private confusing
because it seems like it is a private callback, not private data
belonging to the callback. Perhaps call it callback_private_data? Also
maybe mention what it is for in the comment above
streaming_read_buffer_begin() and in the StreamingRead structure
itself.

- Melanie



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Mon, Mar 25, 2024 at 6:30 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I haven't reviewed the whole patch, but as I was rebasing
> bitmapheapscan streaming read user, I found callback_private confusing
> because it seems like it is a private callback, not private data
> belonging to the callback. Perhaps call it callback_private_data? Also

WFM.

> maybe mention what it is for in the comment above
> streaming_read_buffer_begin() and in the StreamingRead structure
> itself.

Yeah.  I've tried to improve the comments on all three public
functions.  I also moved the three public functions _begin(), _next(),
_end() to be next to each other after the static helper functions.

Working on perf regression/tuning reports today, more soon...

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 24/03/2024 15:02, Thomas Munro wrote:
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
>> for a shorter name.
> 
> Hmm.  The idea of 'buffer' appearing in a couple of names is that
> there are conceptually other kinds of I/O that we might want to
> stream, like raw files or buffers other than the buffer pool, maybe
> even sockets, so this would be part of a family of similar interfaces.
> I think it needs to be clear that this variant gives you buffers.  I'm
> OK with removing "get" but I guess it would be better to keep the
> words in the same order across the three functions?  What about these?
> 
> streaming_read_buffer_begin();
> streaming_read_buffer_next();
> streaming_read_buffer_end();
> 
> Tried like that in this version.  Other ideas would be to make
> "stream" the main noun, buffered_read_stream_begin() or something.
> Ideas welcome.

Works for me, although "streaming_read_buffer" is a pretty long prefix. 
The flags like "STREAMING_READ_MAINTENANCE" probably ought to be 
"STREAMING_READ_BUFFER_MAINTENANCE" as well.

Maybe "buffer_stream_next()"?

> Here are some other changes:
> 
> * I'm fairly happy with the ABC adaptive distance algorithm so far, I
> think, but I spent more time tidying up the way it is implemented.  I
> didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
> so I created a new dense array stream->buffers that behaved as a
> second circular queue.
> 
> * The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
> become the GUC that it always wanted to be: buffer_io_size defaulting
> to 128kB.  Seems like a reasonable thing to have?  Could also
> influence things like bulk write?  (The main problem I have with the
> GUC currently is choosing a category, async resources is wrong....)
> 
> * By analogy, it started to look a bit funny that each range had room
> for a ReadBuffersOperation, and we had enough ranges for
> max_pinned_buffers * 1 block range.  So I booted that out to another
> dense array, of size max_ios.
> 
> * At the same time, Bilal and Andres had been complaining privately
> about 'range' management overheads showing up in perf and creating a
> regression against master on fully cached scans that do nothing else
> (eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
> and no CPU work with the page, a somewhat extreme case but a
> reasonable way to isolate the management costs); having made the above
> change, it suddenly seemed obvious that I should make the buffers
> array the 'main' circular queue, pointing off to another place for
> information required for dealing with misses.  In this version, there
> are no more range objects.  This feels better and occupies and touches
> less memory.  See pictures below.

+1 for all that. Much better!

> * Various indexes and sizes that couldn't quite fit in uint8_t but
> couldn't possibly exceed a few thousand because they are bounded by
> numbers deriving from range-limited GUCs are now int16_t (while I was
> looking for low hanging opportunities to reduce memory usage...)

Is int16 enough though? It seems so, because:

     max_pinned_buffers = Max(max_ios * 4, buffer_io_size);

and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and 
buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.

If someone changes those constants though, int16 might overflow and fail 
in weird ways. I'd suggest being more careful here and explicitly clamp 
max_pinned_buffers at PG_INT16_MAX or have a static assertion or 
something. (I think it needs to be somewhat less than PG_INT16_MAX, 
because of the extra "overflow buffers" stuff and some other places 
where you do arithmetic.)

>     /*
>      * We gave a contiguous range of buffer space to StartReadBuffers(), but
>      * we want it to wrap around at max_pinned_buffers.  Move values that
>      * overflowed into the extra space.  At the same time, put -1 in the I/O
>      * slots for the rest of the buffers to indicate no I/O.  They are covered
>      * by the head buffer's I/O, if there is one.  We avoid a % operator.
>      */
>     overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
>     if (overflow > 0)
>     {
>         memmove(&stream->buffers[0],
>                 &stream->buffers[stream->max_pinned_buffers],
>                 sizeof(stream->buffers[0]) * overflow);
>         for (int i = 0; i < overflow; ++i)
>             stream->buffer_io_indexes[i] = -1;
>         for (int i = 1; i < nblocks - overflow; ++i)
>             stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
>     }
>     else
>     {
>         for (int i = 1; i < nblocks; ++i)
>             stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
>     }

Instead of clearing buffer_io_indexes here, it might be cheaper/simpler 
to initialize the array to -1 in streaming_read_buffer_begin(), and 
reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(), 
after the WaitReadBuffers() call. In other words, except when an I/O is 
in progress, keep all the elements at -1, even the elements that are not 
currently in use.

Alternatively, you could remember the first buffer that the I/O applies 
to in the 'ios' array. In other words, instead of pointing from buffer 
to the I/O that it depends on, point from the I/O to the buffer that 
depends on it. The last attached patch implements that approach. I'm not 
wedded to it, but it feels a little simpler.


>         if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
>         {
>             /* Distance ramps up fast (behavior C). */
>             ...
>         }
>         else
>         {
>             /* No advice; move towards full I/O size (behavior B). */
>             ...
>         }

The comment on ReadBuffersOperation says "Declared in public header only 
to allow inclusion in other structs, but contents should not be 
accessed", but here you access the 'flags' field.

You also mentioned that the StartReadBuffers() argument list is too 
long. Perhaps the solution is to redefine ReadBuffersOperation so that 
it consists of two parts: 1st part is filled in by the caller, and 
contains the arguments, and 2nd part is private to bufmgr.c. The 
signature for StartReadBuffers() would then be just:

bool StartReadBuffers(ReadBuffersOperation *operation);

That would make it OK to read the 'flags' field. It would also allow 
reusing the same ReadBuffersOperation struct for multiple I/Os for the 
same relation; you only need to change the changing parts of the struct 
on each operation.


In the attached patch set, the first three patches are your v9 with no 
changes. The last patch refactors away 'buffer_io_indexes' like I 
mentioned above. The others are fixes for some other trivial things that 
caught my eye.

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Wed, Mar 27, 2024 at 1:40 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Is int16 enough though? It seems so, because:
>
>      max_pinned_buffers = Max(max_ios * 4, buffer_io_size);
>
> and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and
> buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.
>
> If someone changes those constants though, int16 might overflow and fail
> in weird ways. I'd suggest being more careful here and explicitly clamp
> max_pinned_buffers at PG_INT16_MAX or have a static assertion or
> something. (I think it needs to be somewhat less than PG_INT16_MAX,
> because of the extra "overflow buffers" stuff and some other places
> where you do arithmetic.)

Clamp added.

> >       /*
> >        * We gave a contiguous range of buffer space to StartReadBuffers(), but
> >        * we want it to wrap around at max_pinned_buffers.  Move values that
> >        * overflowed into the extra space.  At the same time, put -1 in the I/O
> >        * slots for the rest of the buffers to indicate no I/O.  They are covered
> >        * by the head buffer's I/O, if there is one.  We avoid a % operator.
> >        */
> >       overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
> >       if (overflow > 0)
> >       {
> >               memmove(&stream->buffers[0],
> >                               &stream->buffers[stream->max_pinned_buffers],
> >                               sizeof(stream->buffers[0]) * overflow);
> >               for (int i = 0; i < overflow; ++i)
> >                       stream->buffer_io_indexes[i] = -1;
> >               for (int i = 1; i < nblocks - overflow; ++i)
> >                       stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
> >       }
> >       else
> >       {
> >               for (int i = 1; i < nblocks; ++i)
> >                       stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
> >       }
>
> Instead of clearing buffer_io_indexes here, it might be cheaper/simpler
> to initialize the array to -1 in streaming_read_buffer_begin(), and
> reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(),
> after the WaitReadBuffers() call. In other words, except when an I/O is
> in progress, keep all the elements at -1, even the elements that are not
> currently in use.

Yeah that wasn't nice and I had already got as far as doing exactly
that ↑ on my own, but your second idea ↓ is better!

> Alternatively, you could remember the first buffer that the I/O applies
> to in the 'ios' array. In other words, instead of pointing from buffer
> to the I/O that it depends on, point from the I/O to the buffer that
> depends on it. The last attached patch implements that approach. I'm not
> wedded to it, but it feels a little simpler.

Yeah, nice improvement.

> >               if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
> >               {
> >                       /* Distance ramps up fast (behavior C). */
> >                       ...
> >               }
> >               else
> >               {
> >                       /* No advice; move towards full I/O size (behavior B). */
> >                       ...
> >               }
>
> The comment on ReadBuffersOperation says "Declared in public header only
> to allow inclusion in other structs, but contents should not be
> accessed", but here you access the 'flags' field.
>
> You also mentioned that the StartReadBuffers() argument list is too
> long. Perhaps the solution is to redefine ReadBuffersOperation so that
> it consists of two parts: 1st part is filled in by the caller, and
> contains the arguments, and 2nd part is private to bufmgr.c. The
> signature for StartReadBuffers() would then be just:
>
> bool StartReadBuffers(ReadBuffersOperation *operation);

Yeah.  I had already got as far as doing this on the regression
hunting expedition, but I kept some arguments for frequently changing
things, eg blocknum.  It means that the stuff that never changes is in
there, and the stuff that changes each time doesn't have to be written
to memory at all.

> That would make it OK to read the 'flags' field. It would also allow
> reusing the same ReadBuffersOperation struct for multiple I/Os for the
> same relation; you only need to change the changing parts of the struct
> on each operation.

Right.  Done.

> In the attached patch set, the first three patches are your v9 with no
> changes. The last patch refactors away 'buffer_io_indexes' like I
> mentioned above. The others are fixes for some other trivial things that
> caught my eye.

Thanks, all squashed into the patch.

In an offline chat with Robert and Andres, we searched for a better
name for the GUC.  We came up with "io_combine_limit".  It's easier to
document a general purpose limit than to explain what "buffer_io_size"
does (particularly since the set of affected features will grow over
time but starts so small).  I'll feel better about using it to control
bulk writes too, with that name.

I collapsed the various memory allocations into one palloc.  The
buffer array is now a buffers[FLEXIBLE_ARRAY_MEMBER].

I got rid of "finished" (now represented by distance == 0, I was
removing branches and variables).  I got rid of "started", which can
now be deduced (used for suppressing advice when you're calling
_next() because you need a block and we need to read it immediately),
see the function argument suppress_advice.

Here is a new proposal for the names, updated in v10:

read_stream_begin_relation()
read_stream_next_buffer()
void read_stream_end()

I think we'll finish up with different 'constructor' functions for
different kinds of streams.  For example I already want one that can
provide a multi-relation callback for use by recovery (shown in v1).
Others might exist for raw file access, etc.  The defining
characteristic of this one is that it accesses one specific
relation/fork.  Well, _relfork() might be more accurate but less easy
on the eye.  I won't be surprised if people not following this thread
have ideas after commit; it's certainly happened before something gets
renamed in beta and I won't mind a bit if that happens...

I fixed a thinko in the new ReadBuffer() implementation mentioned
before (thanks to Bilal for pointing this out): it didn't handle the
RBM_ZERO_XXX flags properly.  Well, it did work and the tests passed,
but it performed a useless read first.  I probably got mixed up when I
removed the extended interface which was capable of streaming zeroed
buffers but this simple one isn't and it wasn't suppressing the read
as it would need to.  Later I'll propose to add that back in for
recovery.

I fixed a recently added thinko in the circular queue: I mistakenly
thought I didn't need a spare gap between head and tail anymore
because now we never compare them, since we track the number of pinned
buffers instead, but after read_stream_next_buffer() returns, users of
per-buffer data need that data to remain valid until the next call.
So the recent refactoring work didn't survive contact with Melanie's
BHS work, which uses that.  We need to "wipe" the previous (spare)
one, which can't possibly be in use yet, and if anyone ever sees 0x7f
in their per-buffer data, it will mean that they illegally accessed
the older value after a new call to read_stream_next_buffer().  Fixed,
with comments to clarify.

Retesting with Melanie's latest BHS patch set and my random.sql
(upthread) gives the same system call trace as before.  The intention
of that was to demonstrate what exact sequences
effective_io_concurrency values give.  Now you can also run that with
different values of the new io_combine_limit.  If you run it with
io_combine_limit = '8kB', it looks almost exactly like master ie no
big reads allowed; the only difference is that read_stream.c refuses
to issue advice for strictly sequential reads:

effective_io_concurrency = 1, range size = 2
unpatched                              patched
==============================================================================
pread(93,...,8192,0x58000) = 8192      pread(84,...,8192,0x58000) = 8192
posix_fadvise(93,0x5a000,0x2000,...)   pread(84,...,8192,0x5a000) = 8192
pread(93,...,8192,0x5a000) = 8192      posix_fadvise(84,0xb0000,0x2000,...)
posix_fadvise(93,0xb0000,0x2000,...)   pread(84,...,8192,0xb0000) = 8192
pread(93,...,8192,0xb0000) = 8192      pread(84,...,8192,0xb2000) = 8192
posix_fadvise(93,0xb2000,0x2000,...)   posix_fadvise(84,0x108000,0x2000,...)
pread(93,...,8192,0xb2000) = 8192      pread(84,...,8192,0x108000) = 8192
posix_fadvise(93,0x108000,0x2000,...)  pread(84,...,8192,0x10a000) = 8192

You wouldn't normally see that though as the default io_combine_limit
would just merge those adjacent reads after the first one triggers
ramp-up towards behaviour C:

 effective_io_concurrency = 1, range size = 2
unpatched                              patched
==============================================================================
pread(93,...,8192,0x58000) = 8192      pread(80,...,8192,0x58000) = 8192
posix_fadvise(93,0x5a000,0x2000,...)   pread(80,...,8192,0x5a000) = 8192
pread(93,...,8192,0x5a000) = 8192      posix_fadvise(80,0xb0000,0x4000,...)
posix_fadvise(93,0xb0000,0x2000,...)   preadv(80,...,2,0xb0000) = 16384
pread(93,...,8192,0xb0000) = 8192      posix_fadvise(80,0x108000,0x4000,...)
posix_fadvise(93,0xb2000,0x2000,...)   preadv(80,...,2,0x108000) = 16384
pread(93,...,8192,0xb2000) = 8192      posix_fadvise(80,0x160000,0x4000,...)

I spent most of the past few days trying to regain some lost
performance.  Thanks to Andres for some key observations and help!
That began with reports from Bilal and Melanie (possibly related to
things Tomas had seen too, not sure) of regressions in all-cached
workloads, which I already improved a bit with the ABC algorithm that
minimised pinning for this case.  That is, if there's no recent I/O so
we reach what I call behaviour A, it should try to do as little magic
as possible.  But it turns out that wasn't enough!  It is very hard to
beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over
millions of already-cached blocks, if you have to do exactly the same
work AND extra instructions for management.

There were two layers to the solution in this new version:  First, I
now have a special case in read_stream_next_buffer(), a sort of
open-coded specialisation for behaviour A with no per-buffer data, and
that got rid of most of the regression, but some remained.  Next,
Andres pointed out that ReadBuffer() itself, even though it is now
implemented on top of StartReadBuffers(nblocks = 1), was still beating
my special case code that calls StartReadBuffers(nblocks = 1), even
though it looks about the same, because bufmgr.c was able to inline
and specialise the latter for one block.  To give streaming_read.c
that power from its home inside another translation units, we needed
to export a special case singular StartReadBuffer() (investigation and
patch by Andres, added as co-author).  It just calls the plural
function with nblocks = 1, but it gets inlined.  So now the special
case for behaviour A drills through both layers, and hopefully now
there is no measurable regression.. need to test a bitt more.  Of
course we can't *beat* the old code in this case, yet, but...

(We speculate that a future tree-based buffer mapping table might
allow efficient lookup for a range of block numbers in one go, and
then it could be worth paying the book-keeping costs to find ranges.
Perhaps behaviour A and the associated special case code could then be
deleted, as you'd probably want to use multi-block magic all the time,
for both for I/O and mapping table lookups.  Or something like that?)

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Melanie Plageman
Date:
On Wed, Mar 27, 2024 at 10:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I got rid of "finished" (now represented by distance == 0, I was
> removing branches and variables).  I got rid of "started", which can
> now be deduced (used for suppressing advice when you're calling
> _next() because you need a block and we need to read it immediately),
> see the function argument suppress_advice.

I started rebasing the sequential scan streaming read user over this
new version, and this change (finished now represented with distance
== 0) made me realize that I'm not sure what to set distance to on
rescan.

For sequential scan, I added a little reset function to the streaming
read API (read_stream_reset()) that just releases all the buffers.
Previously, it set finished to true before releasing the buffers (to
indicate it was done) and then set it back to false after. Now, I'll
set distance to 0 before releasing the buffers and !0 after. I could
just restore whatever value distance had before I set it to 0. Or I
could set it to 1. But, thinking about it, are we sure we want to ramp
up in the same way on rescans? Maybe we want to use some information
from the previous scan to determine what to set distance to? Maybe I'm
overcomplicating it...

> Here is a new proposal for the names, updated in v10:
>
> read_stream_begin_relation()
> read_stream_next_buffer()
> void read_stream_end()

Personally, I'm happy with these.

- Melanie



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Thu, Mar 28, 2024 at 9:43 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> For sequential scan, I added a little reset function to the streaming
> read API (read_stream_reset()) that just releases all the buffers.
> Previously, it set finished to true before releasing the buffers (to
> indicate it was done) and then set it back to false after. Now, I'll
> set distance to 0 before releasing the buffers and !0 after. I could
> just restore whatever value distance had before I set it to 0. Or I
> could set it to 1. But, thinking about it, are we sure we want to ramp
> up in the same way on rescans? Maybe we want to use some information
> from the previous scan to determine what to set distance to? Maybe I'm
> overcomplicating it...

I think 1 is good, as a rescan is even more likely to find the pages
in cache, and if that turns out to be wrong it'll very soon adjust.



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Thu, Mar 28, 2024 at 10:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I think 1 is good, as a rescan is even more likely to find the pages
> in cache, and if that turns out to be wrong it'll very soon adjust.

Hmm, no I take that back, it probably won't be due to the
strategy/ring...  I see your point now... when I had a separate flag,
the old distance was remembered across but now I'm zapping it.  I was
trying to minimise the number of variables that have to be tested in
the fast path by consolidating.  Hmm, it is signed -- would it be too
weird if we used a negative number for "finished", so we can just flip
it on reset?



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Mon, Mar 25, 2024 at 2:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >       /*
> > >        * Skip the initial ramp-up phase if the caller says we're going to be
> > >        * reading the whole relation.  This way we start out doing full-sized
> > >        * reads.
> > >        */
> > >       if (flags & PGSR_FLAG_FULL)
> > >               pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
> > >       else
> > >               pgsr->distance = 1;
> >
> > Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
> > pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
> > MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
> > perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?
>
> Right, done.

BTW I forgot to mention that in v10 I changed my mind and debugged my
way back to the original coding, which now looks like this:

    /*
     * Skip the initial ramp-up phase if the caller says we're going to be
     * reading the whole relation.  This way we start out assuming we'll be
     * doing full io_combine_limit sized reads (behavior B).
     */
    if (flags & READ_STREAM_FULL)
        stream->distance = Min(max_pinned_buffers, io_combine_limit);
    else
        stream->distance = 1;

It's not OK for distance to exceed max_pinned_buffers.  But if
max_pinned_buffers is huge, remember that the goal here is to access
'behavior B' meaning wide read calls but no unnecessary extra
look-ahead beyond what is needed for that, so we also don't want to
exceed io_combine_limit.  Therefore we want the minimum of those two
numbers.  In practice on a non-toy system, that's always going to be
io_combine_limit.  But I'm not sure how many users of READ_STREAM_FULL
there will be, and I am starting to wonder if it's a good name for the
flag, or even generally very useful.  It's sort of saying "I expect to
do I/O, and it'll be sequential, and I won't give up until the end".
But how many users can really make those claims?  pg_prewarm is unsual
in that it contains an explicit assumption that the cache is cold and
we want to warm it up.  But maybe we should just let the adaptive
algorithm do its thing.  It only takes a few reads to go from 1 ->
io_combine_limit.

Thinking harder, if we're going to keep this and not just be fully
adaptive, perhaps there should be a flag READ_STREAM_COLD, where you
hint that the data is not expected to be cached, and you'd combine
that with the _SEQUENTIAL hint.  pg_prewarm hints _COLD | _SEQUENTIAL.
Then the initial distance would be something uses the flag
combinations to select initial behavior A, B, C (and we'll quickly
adjust if you're wrong):

    if (!(flags & READ_STREAM_COLD))
        stream->distance = 1;
    else if (flags & READ_STREAM_SEQUENTIAL)
        stream->distance = Min(max_pinned_buffers, io_combine_limit);
    else
        stream->distance = max_pinned_buffers;

But probably almost all users especially in the executor haven't
really got much of a clue what they're going to do so they'd use the
initial starting position of 1 (A) and we'd soo figure it out.  Maybe
overengineering for pg_prewarm is a waste of time and we should just
delete the flag instead and hard code 1.



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Thu, Mar 28, 2024 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... In practice on a non-toy system, that's always going to be
> io_combine_limit.  ...

And to be more explicit about that: you're right that we initialise
max_pinned_buffers such that it's usually at least io_combine_limit,
but then if you have a very small buffer pool it gets clobbered back
down again by LimitAdditionalBins() and may finish up as low as 1.
You're not allowed to pin more than 1/Nth of the whole buffer pool,
where N is approximately max connections (well it's not exactly that
but that's the general idea).  So it's a degenerate case, but it can
happen that max_pinned_buffers is lower than io_combine_limit and then
it's important not to set distance higher or you'd exceed the allowed
limits (or more likely the circular data structure would implode).



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
New version with some cosmetic/comment changes, and Melanie's
read_stream_reset() function merged, as required by her sequential
scan user patch.  I tweaked it slightly: it might as well share code
with read_stream_end().  I think setting distance = 1 is fine for now,
and we might later want to adjust that as we learn more about more
interesting users of _reset().

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
Small bug fix: the condition in the final test at the end of
read_stream_look_ahead() wasn't quite right.  In general when looking
ahead, we don't need to start a read just because the pending read
would bring us up to stream->distance if submitted now (we'd prefer to
build it all the way up to size io_combine_limit if we can), but if
that condition is met AND we have nothing pinned yet, then there is no
chance for the read to grow bigger by a pinned buffer being consumed.
Fixed, comment updated.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Fri, Mar 29, 2024 at 12:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Small bug fix: the condition in the final test at the end of
> read_stream_look_ahead() wasn't quite right.  In general when looking
> ahead, we don't need to start a read just because the pending read
> would bring us up to stream->distance if submitted now (we'd prefer to
> build it all the way up to size io_combine_limit if we can), but if
> that condition is met AND we have nothing pinned yet, then there is no
> chance for the read to grow bigger by a pinned buffer being consumed.
> Fixed, comment updated.

Oops, I sent the wrong/unfixed version.  This version has the fix
described above.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
> I spent most of the past few days trying to regain some lost
> performance.  Thanks to Andres for some key observations and help!
> That began with reports from Bilal and Melanie (possibly related to
> things Tomas had seen too, not sure) of regressions in all-cached
> workloads, which I already improved a bit with the ABC algorithm that
> minimised pinning for this case.  That is, if there's no recent I/O so
> we reach what I call behaviour A, it should try to do as little magic
> as possible.  But it turns out that wasn't enough!  It is very hard to
> beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over
> millions of already-cached blocks, if you have to do exactly the same
> work AND extra instructions for management.

I got a little nerd-sniped by that, and did some micro-benchmarking of 
my own. I tested essentially this, with small values of 'nblocks' so 
that all pages are in cache:

    for (int i = 0; i < niters; i++)
    {
        for (BlockNumber blkno = 0; blkno < nblocks; blkno++)
        {
            buf = ReadBuffer(rel, blkno);
            ReleaseBuffer(buf);
        }
    }

The results look like this (lower is better, test program and script 
attached):

master (213c959a29):        8.0 s
streaming-api v13:        9.5 s

This test exercises just the ReadBuffer() codepath, to check if there is 
a regression there. It does not exercise the new streaming APIs.

So looks like the streaming API patches add some overhead to the simple 
non-streaming ReadBuffer() case. This is a highly narrow 
micro-benchmark, of course, so even though this is a very 
performance-sensitive codepath, we could perhaps accept a small 
regression there. In any real workload, you'd at least need to take the 
buffer lock and read something from the page.

But can we do better? Aside from performance, I was never quite happy 
with the BMR_REL/BMR_SMGR stuff we introduced in PG v16. I like having 
one common struct like BufferManagerRelation that is used in all the 
functions, instead of having separate Relation and SMgrRelation variants 
of every function. But those macros feel a bit hacky and we are not 
consistently using them in all the functions. Why is there no 
ReadBuffer() variant that takes a BufferManagerRelation?

The attached patch expands the use of BufferManagerRelations. The 
principle now is that before calling any bufmgr function, you first 
initialize a BufferManagerRelation struct, and pass that to the 
function. The initialization is done by the InitBMRForRel() or 
InitBMRForSmgr() function, which replace the BMR_REL/BMR_SMGR macros. 
They are full-blown functions now because they do more setup upfront 
than BMR_REL/BMR_SMGR. For example, InitBMRForRel() always initializes 
the 'smgr' field, so that you don't need to repeat this pattern in all 
the other functions:

-       /* Make sure our bmr's smgr and persistent are populated. */
-       if (bmr.smgr == NULL)
-       {
-               bmr.smgr = RelationGetSmgr(bmr.rel);
-               bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-       }

Initializing the BufferManagerRelation is still pretty cheap, so it's 
feasible to call it separately for every ReadBuffer() call. But you can 
also reuse it across calls, if you read multiple pages in a loop, for 
example. That saves a few cycles.

The microbenchmark results with these changes:

master (213c959a29):        8.0 s
streaming-api v13:        9.5 s
bmr-refactor            8.4 s
bmr-refactor, InitBMR once    7.7 s

The difference between the "bmr-refactor" and "initBMR once" is that in 
the "initBMR once" test, I modified the benchmark to call 
InitBMRForRel() just once, outside the loop. So that shows the benefit 
of reusing the BufferManagerRelation. This refactoring seems to make 
performance regression smaller, even if you don't take advantage of 
reusing the BufferManagerRelation.

This also moves things around a little in ReadBuffer_common() (now 
called ReadBufferBMR). Instead of calling StartReadBuffer(), it calls 
PinBufferForBlock() directly. I tried doing that before the other 
refactorings, but that alone didn't seem to make much difference. Not 
sure if it's needed, it's perhaps an orthogonal refactoring, but it's 
included here nevertheless.

What do you think? The first three attached patches are your v13 patches 
unchanged. The fourth is the micro-benchmark I used. The last patch is 
the interesting one.


PS. To be clear, I'm happy with your v13 streaming patch set as it is. I 
don't think this BufferManagerRelation refactoring is a show-stopper.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> master (213c959a29):            8.0 s
> streaming-api v13:              9.5 s

Hmm, that's not great, and I think I know one factor that has
confounded my investigation and the conflicting reports I have
received from a couple of people: some are using meson, which is
defaulting to -O3 by default, and others are using make which gives
you -O2 by default, but at -O2, GCC doesn't inline that
StartReadBuffer specialisation that is used in the "fast path", and
possibly more.  Some of that gap is closed by using
pg_attribute_inline_always.  Clang fails to inline at any level.  So I
should probably use the "always" macro there because that is the
intention.  Still processing the rest of your email...



Re: Streaming I/O, vectored I/O (WIP)

From
Heikki Linnakangas
Date:
On 29/03/2024 09:01, Thomas Munro wrote:
> On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> master (213c959a29):            8.0 s
>> streaming-api v13:              9.5 s
> 
> Hmm, that's not great, and I think I know one factor that has
> confounded my investigation and the conflicting reports I have
> received from a couple of people: some are using meson, which is
> defaulting to -O3 by default, and others are using make which gives
> you -O2 by default, but at -O2, GCC doesn't inline that
> StartReadBuffer specialisation that is used in the "fast path", and
> possibly more.  Some of that gap is closed by using
> pg_attribute_inline_always.  Clang fails to inline at any level.  So I
> should probably use the "always" macro there because that is the
> intention.  Still processing the rest of your email...

Ah yeah, I also noticed that the inlining didn't happen with some 
compilers and flags. I use a mix of gcc and clang and meson and autoconf 
in my local environment.

The above micro-benchmarks were with meson and gcc -O3. GCC version:

$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
1.  I tried out Tomas's suggestion ALTER TABLESPACE ts SET
(io_combine_limit = ...).  I like it, it's simple and works nicely.
Unfortunately we don't have support for units like '128kB' in
reloptions.c, so for now it requires a number of blocks.  That's not
great, so we should probably fix that before merging it, so I'm
leaving that patch (v14-0004) separate, hopefully for later.

2. I also tried Tomas's suggestion of inventing a way to tell
PostgreSQL what the OS readahead window size is.  That allows for
better modelling of whether the kernel is going to consider an access
pattern to be sequential.  Again, the ALTER TABLESPACE version would
probably need unit support.  This is initially just for
experimentation as it came up in discussions of BHS behaviour.  I was
against this idea originally as it seemed like more complexity to have
to explain and tune and I'm not sure if it's worth the trouble... but
in fact we are already in the business of second guessing kernel
logic, so there doesn't seem to be any good reason not to do it a bit
better if we can.  Do you think I have correctly understood what Linux
is doing?  The name I came up with is: effective_io_readahead_window.
I wanted to make clear that it's a property of the system we are
telling it about, not to be confused with our own look-ahead concept
or whatever.  Better names very welcome.  This is also in a separate
patch (v14-0005), left for later.

3.  Another question I wondered about while retesting: does this need
to be so low?  I don't think so, so I've added a patch for that.

src/include/port/pg_iovec.h:#define PG_IOV_MAX Min(IOV_MAX, 32)

Now that I'm not using an array full of arrays of that size, I don't
care so much how big we make that 32 (= 256kB @ 8kB), which clamps
io_combine_limit.  I think 128 (= 1MB @ 8kB) might be a decent
arbitrary number.  Sometimes we use it to size stack arrays, so I
don't want to make it insanely large, but 128 should be fine.  I think
it would be good to be able to at least experiment with up to 1MB (I'm
not saying it's a good idea to do it, who knows?, just that there
isn't a technical reason why not to allow it AFAIK).  FWIW every
system on our target list that has p{read,write}v has IOV_MAX == 1024
(I checked {Free,Net,Open}BSD, macOS, illumos and Linux), so the
Min(IOV_MAX, ...) really only clamps the systems where
pg_{read,write}v fall back to loop-based emulation (Windows, Solaris)
which is fine.

PG_IOV_MAX also affects the routine that initialises new WAL files.  I
don't currently see a downside to doing that in 1MB chunks, as there
was nothing sacred about the previous arbitrary number and the code
deals with short writes by retrying as it should.

4.  I agree with Heikki's complaints about the BMR interface.  It
should be made more consistent and faster.  I didn't want to make all
of those changes touching AMs etc a dependency though, so I spent some
time trying to squeeze out regressions using some of these clues about
calling conventions, likely hints, memory access and batching.  I'm
totally open to later improvements and refactoring of that stuff
later!

Attached is the version with the best results I've managed to get.  My
test is GCC -O3, pg_prewarm of a table of 220_000_000 integers =
7.6GB, which sometimes comes out around the same ~250ms on master and
streaming pg_prewarm v14 on a random cloud ARM box I'm testing with,
but not always, sometimes it's ~5-7ms more.  (Unfortunately I don't
have access to good benchmarking equipment right now, better numbers
welcome.)  Two new ideas:

* give fast path mode a single flag, instead of testing all the
conditions for every block
* give fast path mode a little buffer of future block numbers, so it
can call the callback in batches

I'd tried that batch-calling thing before, and results were
inconclusive, but I think sometimes it helps a bit.  Note that it
replaces the 'unget' thing from before and it is possibly a tiny bit
nicer anyway.

I'm a bit stumped about how to improve this further -- if anyone has
any ideas for further improvements I'm all ears.

Zooming back out of micro-benchmark mode, it must be pretty hard to
see in a real workload that actually does something with the buffers,
like a sequential scan.  Earlier complaints about all-cached
sequential scan regressions were resolved many versions ago AFAIK by
minimising pin count in that case.   I just tried Melanie's streaming
sequential scan patch, with a simple SELECT COUNT(*) WHERE i = -1,
with the same all-cached table of 220 million integers.  Patched
consistently comes out ahead for all-in-kernel-cache none-in-PG-cache:
~14.7-> ~14.4, and all-in-PG-cache ~13.5s -> ~13.3s (which I don't
have an explanation for).  I don't claim any of that is particularly
scientific, I just wanted to note that single digit numbers of
milliseconds of regression while pinning a million pages is clearly
lost in the noise of other effects once you add in real query
execution.  That's half a dozen nanoseconds per page if I counted
right.

So, I am finally starting to think we should commit this, and decide
which user patches are candidates.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface.  Heikki didn't like it much, and in the end,
neither did I.  I have now removed it, and it seems much better.  No
other significant changes, just parameter types and inlining details.
For example:

 * read_stream_begin_relation() now takes a Relation, likes its name says
 * StartReadBuffers()'s operation takes smgr and optional rel
 * ReadBuffer_common() takes smgr and optional rel

ReadBuffer() (which calls ReadBuffer_common() which calls
StartReadBuffer() as before) now shows no regression in a tight loop
over ~1 million already-in-cache pages (something Heikki had observed
before and could only completely fix with a change that affected all
callers).  The same test using read_stream.c is still slightly slower,
~1 million pages -in-cache pages 301ms -> 308ms, which seems
acceptable to me and could perhaps be chased down with more study of
inlining/specialisation.  As mentioned before, it doesn't seem to be
measurable once you actually do something with the pages.

In some ways BMR was better than the "fake RelationData" concept
(another attempt at wrestling with the relation vs storage duality,
that is, the online vs recovery duality).  But in other ways it was
worse: a weird inconsistent mixture of pass-by-pointer and
pass-by-value interfaces that required several code paths to handle it
being only partially initialised, which turned out to be wasted cycles
implicated in regressions, despite which it is not even very nice to
use anyway.  I'm sure it could be made to work better, but I'm not yet
sure it's really needed.  In later work for recovery I will need to
add a separate constructor read_stream_begin_smgr_something() anyway
for other reasons (multi-relation streaming, different callback) and
perhaps also a separate StartReadBuffersSmgr() if it saves measurable
cycles to strip out branches.  Maybe it was all just premature
pessimisation.

So this is the version I'm going to commit shortly, barring objections.

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Tue, Apr 2, 2024 at 9:39 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> So this is the version I'm going to commit shortly, barring objections.

And done, after fixing a small snafu with smgr-only reads coming from
CreateAndCopyRelationData() (BM_PERMANENT would be
incorrectly/unnecessarily set for unlogged tables).

Here are the remaining patches discussed in this thread.  They give
tablespace-specific io_combine_limit, effective_io_readahead_window
(is this useful?), and up-to-1MB io_combine_limit (is this useful?).
I think the first two would probably require teaching reloption.c how
to use guc.c's parse_int() and unit flags, but I won't have time to
look at that for this release so I'll just leave these here.

On the subject of guc.c, this is a terrible error message... did I do
something wrong?

postgres=# set io_combine_limit = '42MB';
ERROR:  5376 8kB is outside the valid range for parameter
"io_combine_limit" (1 .. 32)

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Melanie Plageman
Date:
On Tue, Apr 2, 2024 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Here are the remaining patches discussed in this thread.  They give
> tablespace-specific io_combine_limit, effective_io_readahead_window
> (is this useful?), and up-to-1MB io_combine_limit (is this useful?).
> I think the first two would probably require teaching reloption.c how
> to use guc.c's parse_int() and unit flags, but I won't have time to
> look at that for this release so I'll just leave these here.
>
> On the subject of guc.c, this is a terrible error message... did I do
> something wrong?
>
> postgres=# set io_combine_limit = '42MB';
> ERROR:  5376 8kB is outside the valid range for parameter
> "io_combine_limit" (1 .. 32)

Well, GUC_UNIT_BLOCKS interpolates the block limit into the error
message string (get_config_unit_name()). But, I can't imagine this
error message is clear for any of the GUCs using GUC_UNIT_BLOCKS. I
would think some combination of the two would be helpful, like "43008
kB (5376 blocks) is outside of the valid range for parameter". The
user can check what their block size is. I don't think we need to
interpolate and print the block size in the error message.

On another note, since io_combine_limit, when specified in size,
rounds up to the nearest multiple of blocksize, it might be worth
mentioning this in the io_combine_limit docs at some point. I checked
docs for another GUC_UNIT_BLOCKS guc, backend_flush_after, and it
alludes to this.

- Melanie



Re: Streaming I/O, vectored I/O (WIP)

From
Nazir Bilal Yavuz
Date:
Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I had been planning to commit v14 this morning but got cold feet with
> the BMR-based interface.  Heikki didn't like it much, and in the end,
> neither did I.  I have now removed it, and it seems much better.  No
> other significant changes, just parameter types and inlining details.
> For example:
>
>  * read_stream_begin_relation() now takes a Relation, likes its name says
>  * StartReadBuffers()'s operation takes smgr and optional rel
>  * ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Melanie Plageman
Date:
On Sun, Apr 7, 2024 at 1:33 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

This patch LGTM

- Melanie



Re: Streaming I/O, vectored I/O (WIP)

From
Nazir Bilal Yavuz
Date:
Hi,

On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > I had been planning to commit v14 this morning but got cold feet with
> > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > neither did I.  I have now removed it, and it seems much better.  No
> > other significant changes, just parameter types and inlining details.
> > For example:
> >
> >  * read_stream_begin_relation() now takes a Relation, likes its name says
> >  * StartReadBuffers()'s operation takes smgr and optional rel
> >  * ReadBuffer_common() takes smgr and optional rel
>
> Read stream objects can be created only using Relations now. There
> could be read stream users which do not have a Relation but
> SMgrRelations. So, I created another constructor for the read streams
> which use SMgrRelations instead of Relations. Related patch is
> attached.

After sending this, I realized that I forgot to add persistence value
to the new constructor. While working on it I also realized that
current code sets persistence in PinBufferForBlock() function and this
function is called for each block, which can be costly. So, I moved
setting persistence to the out of PinBufferForBlock() function.

Setting persistence outside of the PinBufferForBlock() function (0001)
and creating the new constructor that uses SMgrRelations (0002) are
attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Nazir Bilal Yavuz
Date:
Hi,

On Mon, 8 Apr 2024 at 00:01, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:
> > >
> > > I had been planning to commit v14 this morning but got cold feet with
> > > the BMR-based interface.  Heikki didn't like it much, and in the end,
> > > neither did I.  I have now removed it, and it seems much better.  No
> > > other significant changes, just parameter types and inlining details.
> > > For example:
> > >
> > >  * read_stream_begin_relation() now takes a Relation, likes its name says
> > >  * StartReadBuffers()'s operation takes smgr and optional rel
> > >  * ReadBuffer_common() takes smgr and optional rel
> >
> > Read stream objects can be created only using Relations now. There
> > could be read stream users which do not have a Relation but
> > SMgrRelations. So, I created another constructor for the read streams
> > which use SMgrRelations instead of Relations. Related patch is
> > attached.
>
> After sending this, I realized that I forgot to add persistence value
> to the new constructor. While working on it I also realized that
> current code sets persistence in PinBufferForBlock() function and this
> function is called for each block, which can be costly. So, I moved
> setting persistence to the out of PinBufferForBlock() function.
>
> Setting persistence outside of the PinBufferForBlock() function (0001)
> and creating the new constructor that uses SMgrRelations (0002) are
> attached.

Melanie noticed there was a 'sgmr -> smgr' typo in 0002. Fixed in attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
David Rowley
Date:
I've attached a patch with a few typo fixes and what looks like an
incorrect type for max_ios. It's an int16 and I think it needs to be
an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
anything when max_ios is int16.

David

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
David Rowley
Date:
On Wed, 24 Apr 2024 at 14:32, David Rowley <dgrowleyml@gmail.com> wrote:
> I've attached a patch with a few typo fixes and what looks like an
> incorrect type for max_ios. It's an int16 and I think it needs to be
> an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
> anything when max_ios is int16.

No feedback, so I'll just push this in a few hours unless anyone has anything.

David



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Wed, May 1, 2024 at 2:51 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 24 Apr 2024 at 14:32, David Rowley <dgrowleyml@gmail.com> wrote:
> > I've attached a patch with a few typo fixes and what looks like an
> > incorrect type for max_ios. It's an int16 and I think it needs to be
> > an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
> > anything when max_ios is int16.
>
> No feedback, so I'll just push this in a few hours unless anyone has anything.

Patch looks correct, thanks.  Please do.  (Sorry, running a bit behind
on email ATM...  I also have a few more typos around here from an
off-list email from Mr Lakhin, will get to that soon...)



Re: Streaming I/O, vectored I/O (WIP)

From
Thomas Munro
Date:
On Wed, Nov 29, 2023 at 1:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Done.  I like it, I just feel a bit bad about moving the p*v()
> replacement functions around a couple of times already!  I figured it
> might as well be static inline even if we use the fallback (= Solaris
> and Windows).

Just for the record, since I'd said things like the above a few times
while writing about this stuff: Solaris 11.4.69 has gained preadv()
and pwritev().  That's interesting because it means that there will
soon be no liive Unixoid operating systems left without them, and the
fallback code in src/include/port/pg_iovec.h will, in practice, be
only for Windows.  I wondered if that might have implications for how
we code or comment stuff like that, but it still seems to make sense
as we have it.

(I don't think Windows can have a real synchronous implementation; the
kernel knows how to do scatter/gather, a feature implemented
specifically for databases, but only in asynchronous ("overlapped") +
direct I/O mode, a difference I don't know how to hide at this level.
In later AIO work we should be able to use it as intended, but not by
pretending to be Unix like this.)



Re: Streaming I/O, vectored I/O (WIP)

From
Nazir Bilal Yavuz
Date:
Hi,

It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch'
[1] is partially applied, the top comment is not updated. The attached
patch just updates it.

[1] https://www.postgresql.org/message-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07%40iki.fi

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Streaming I/O, vectored I/O (WIP)

From
Bruce Momjian
Date:
On Wed, Jul 10, 2024 at 07:21:59PM +0300, Nazir Bilal Yavuz wrote:
> Hi,
> 
> It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch'
> [1] is partially applied, the top comment is not updated. The attached
> patch just updates it.
> 
> [1] https://www.postgresql.org/message-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07%40iki.fi

Thanks, patch applied to master.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.