Re: Don't synchronously wait for already-in-progress IO in read stream - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Don't synchronously wait for already-in-progress IO in read stream
Date
Msg-id u73un3xeljr4fiidzwi4ikcr6vm7oqugn4fo5vqpstjio6anl2@hph6fvdiiria
Whole thread
In response to Re: Don't synchronously wait for already-in-progress IO in read stream  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Don't synchronously wait for already-in-progress IO in read stream
List pgsql-hackers
Hi,

> Attached v5 adds some comments to the tests, fixes a few nits in the
> actual code, and adds a commit to fix what I think is an existing
> off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.


> Subject: [PATCH v5 3/5] Fix off-by-one error in read IO tracing
>
> ---
>  src/backend/storage/buffer/bufmgr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 00bc609529a..0723d4f3dd8 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
>           * must have started out as a miss in PinBufferForBlock(). The other
>           * backend will track this as a 'read'.
>           */
> -        TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
> +        TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
>                                            operation->smgr->smgr_rlocator.locator.spcOid,
>                                            operation->smgr->smgr_rlocator.locator.dbOid,
>                                            operation->smgr->smgr_rlocator.locator.relNumber,
> --

Ah, the issue is that we already incremented nblocks_done, right?  Maybe it'd
be easier to understand if we stashed blocknum + nblocks_done into a local
var, and use it in in both branches of if (!ReadBuffersCanStartIO())?

This probably needs to be backpatched...



> Subject: [PATCH v5 4/5] Make buffer hit helper
>
> Already two places count buffer hits, requiring quite a few lines of
> code since we do accounting in so many places. Future commits will add
> more locations, so refactor into a helper.
>
> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
> Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
> ---
>  src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
>  1 file changed, 56 insertions(+), 55 deletions(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 0723d4f3dd8..399004c2e44 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
>                                        bool *foundPtr, IOContext io_context);
>  static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
>  static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
> +
> +static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy strategy,
> +                                                      Relation rel, char persistence, SMgrRelation smgr,
> +                                                      ForkNumber forknum, BlockNumber blocknum);
>  static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
>  static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
>                                  IOObject io_object, IOContext io_context);
> @@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel,
>                    bool *foundPtr)
>  {
>      BufferDesc *bufHdr;
> -    IOContext    io_context;
> -    IOObject    io_object;
>
>      Assert(blockNum != P_NEW);
>
> @@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
>              persistence == RELPERSISTENCE_PERMANENT ||
>              persistence == RELPERSISTENCE_UNLOGGED));
>
> -    if (persistence == RELPERSISTENCE_TEMP)
> -    {
> -        io_context = IOCONTEXT_NORMAL;
> -        io_object = IOOBJECT_TEMP_RELATION;
> -    }
> -    else
> -    {
> -        io_context = IOContextForStrategy(strategy);
> -        io_object = IOOBJECT_RELATION;
> -    }
> -

I'm mildly worried that this will lead to a bit worse code generation, the
compiler might have a harder time figuring out that io_context/io_object
doesn't change across multiple PinBufferForBlock calls. Although it already
might be unable to do so (we don't mark IOContextForStrategy as
pure [1]).

I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite
direction, and explicitly look up IOContextForStrategy(strategy) *before* the
actual_nblocks loop to make sure the compiler doesn't inject external function
calls (which will in all likelihood require register spilling etc).

I don't think that necessarily has to conflict with the goal of this patch -
most of the the deduplicated stuff isn't io_context, so the helper will be
beneficial even if have to pull out the io_context/io_object determination to
the callsites.


>      TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
>                                         smgr->smgr_rlocator.locator.spcOid,
>                                         smgr->smgr_rlocator.locator.dbOid,
> @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
>                                         smgr->smgr_rlocator.backend);
>
>      if (persistence == RELPERSISTENCE_TEMP)
> -    {
>          bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
> -        if (*foundPtr)
> -            pgBufferUsage.local_blks_hit++;
> -    }
>      else
> -    {
>          bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
> -                             strategy, foundPtr, io_context);
> -        if (*foundPtr)
> -            pgBufferUsage.shared_blks_hit++;
> -    }
> +                             strategy, foundPtr, IOContextForStrategy(strategy));
> +
>      if (rel)
>      {
>          /*

And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
branch in CountBufferHit(), I suspect the compiler may not be able to optimize
it away.

At the very least I'd invert the call to CountBufferHit() and the
pgstat_count_buffer_read(), as the latter will probably prevent most
optimizations (due to the compiler not being able to prove that
(rel)->pgstat_info->counts.blocks_fetched is a different memory location as
*foundPtr).



> +/*
> + * We track various stats related to buffer hits. Because this is done in a
> + * few separate places, this helper exists for convenience.
> + */
> +static pg_attribute_always_inline void
> +CountBufferHit(BufferAccessStrategy strategy,
> +               Relation rel, char persistence, SMgrRelation smgr,
> +               ForkNumber forknum, BlockNumber blocknum)
> +{
> +    IOContext    io_context;
> +    IOObject    io_object;
> +
> +    if (persistence == RELPERSISTENCE_TEMP)
> +    {
> +        io_context = IOCONTEXT_NORMAL;
> +        io_object = IOOBJECT_TEMP_RELATION;
> +    }
> +    else
> +    {
> +        io_context = IOContextForStrategy(strategy);
> +        io_object = IOOBJECT_RELATION;
> +    }
> +
> +    TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
> +                                      blocknum,
> +                                      smgr->smgr_rlocator.locator.spcOid,
> +                                      smgr->smgr_rlocator.locator.dbOid,
> +                                      smgr->smgr_rlocator.locator.relNumber,
> +                                      smgr->smgr_rlocator.backend,
> +                                      true);
> +
> +    if (persistence == RELPERSISTENCE_TEMP)
> +        pgBufferUsage.local_blks_hit += 1;
> +    else
> +        pgBufferUsage.shared_blks_hit += 1;
> +
> +    if (rel)
> +        pgstat_count_buffer_hit(rel);
> +
> +    pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
> +
> +    if (VacuumCostActive)
> +        VacuumCostBalance += VacuumCostPageHit;
> +}

I don't think "Count*" is a great name for something that does tracepoints and
vacuum cost balance accounting, the latter actually changes behavior of the
program due to the sleeps it injects.

The first alternative I have is AccountForBufferHit(), not great, but still
seems a bit better.



> From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 23 Jan 2026 14:00:31 -0500
> Subject: [PATCH v5 5/5] Don't wait for already in-progress IO
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> When a backend attempts to start a read on a buffer and finds that I/O
> is already in progress, it previously waited for that I/O to complete
> before initiating reads for any other buffers. Although the backend must
> still wait for the I/O to finish when later acquiring the buffer, it
> should not need to wait at read start time. Other buffers may be
> available for I/O, and in some workloads this waiting significantly
> reduces concurrency.
>
> For example, index scans may repeatedly request the same heap block. If
> the backend waits each time it encounters an in-progress read, the
> access pattern effectively degenerates into synchronous I/O. By
> introducing the concept of foreign I/O operations, a backend can record
> the buffer’s wait reference and defer waiting until WaitReadBuffers()
> when it actually acquires the buffer.
>
> In rare cases, a backend may still need to wait when starting a read if
> it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
> but before the buffer descriptor’s wait reference has been set. Such
> windows should be brief and uncommon.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Andres Freund <andres@anarazel.de>
> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
> Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv

> +/*
> + * In AsyncReadBuffers(), when preparing a buffer for reading and setting
> + * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
> + * already contain the desired block. AsyncReadBuffers() must distinguish
> + * between these cases (and the case where it should initiate I/O) so it can
> + * mark an in-progress buffer as foreign I/O rather than waiting on it.
> + */
> +typedef enum PrepareReadBuffer_Status
> +{
> +    READ_BUFFER_ALREADY_DONE,
> +    READ_BUFFER_IN_PROGRESS,
> +    READ_BUFFER_READY_FOR_IO,
> +} PrepareReadBuffer_Status;

I don't personally like mixing underscore and camel case naming within one
name.

I wonder if might be worth splitting this up in a refactoring and a
"behavioural change" commit. Might be too complicated.

Candidates for a split seem to be:
- Moving pgaio_io_acquire_nb() to earlier
- Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support
for READ_BUFFER_IN_PROGRESS
- introduce READ_BUFFER_IN_PROGRESS


>  /*
>   * We track various stats related to buffer hits. Because this is done in a
>   * few separate places, this helper exists for convenience.
> @@ -1815,8 +1791,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>               * b) reports some time as waiting, even if we never waited
>               *
>               * we first check if we already know the IO is complete.
> +             *
> +             * Note that operation->io_return is uninitialized for foreign IO,
> +             * so we cannot count that wait time.
>               */

I'm confused - your comment says we can't count wait time with a foreign IO,
but then oes on to count foreign IO time?  The lack of io_return just means we
can't do  the cheaper pre-check for PGAIO_RS_UNKNOWN, no?


> -            if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
> +            if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
>                  !pgaio_wref_check_done(&operation->io_wref))
>              {
>                  instr_time    io_start = pgstat_prepare_io_time(track_io_timing);
> @@ -1835,11 +1814,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
>                  Assert(pgaio_wref_check_done(&operation->io_wref));
>              }
>
> -            /*
> -             * We now are sure the IO completed. Check the results. This
> -             * includes reporting on errors if there were any.
> -             */
> -            ProcessReadBuffersResult(operation);
> +            if (unlikely(operation->foreign_io))
> +            {
> +                Buffer        buffer = operation->buffers[operation->nblocks_done];
> +                BufferDesc *desc = BufferIsLocal(buffer) ?
> +                    GetLocalBufferDescriptor(-buffer - 1) :
> +                    GetBufferDescriptor(buffer - 1);
> +                uint32        buf_state = pg_atomic_read_u64(&desc->state);
> +
> +                if (buf_state & BM_VALID)
> +                {
> +                    operation->nblocks_done += 1;
> +                    Assert(operation->nblocks_done <= operation->nblocks);
> +
> +                    CountBufferHit(operation->strategy,
> +                                   operation->rel, operation->persistence,
> +                                   operation->smgr, operation->forknum,
> +                                   operation->blocknum + operation->nblocks_done - 1);

Probably worth including a comment explaining why we count this as a hit. IIRC
earlier versions had such a comment.


> +/*
> + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
> + * avoid an external function call.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
> +                            Buffer buffer)

Hm, seems the test in 0002 should be extended to cover the the temp table case.



> +{
> +    BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
> +    uint64        buf_state = pg_atomic_read_u64(&desc->state);
> +
> +    /* Already valid, no work to do */
> +    if (buf_state & BM_VALID)
> +    {
> +        pgaio_wref_clear(&operation->io_wref);
> +        return READ_BUFFER_ALREADY_DONE;
> +    }

Is this reachable for local buffers?


> +    pgaio_submit_staged();
> +
> +    if (pgaio_wref_valid(&desc->io_wref))
> +    {
> +        operation->io_wref = desc->io_wref;
> +        operation->foreign_io = true;
> +        return READ_BUFFER_IN_PROGRESS;
> +    }
> +
> +    /*
> +     * While it is possible for a buffer to have been prepared for IO but not
> +     * yet had its wait reference set, there's no way for us to know that for
> +     * temporary buffers. Thus, we'll prepare for own IO on this buffer.
> +     */
> +    return READ_BUFFER_READY_FOR_IO;

Is that actually possible? And would it be ok to just do start IO in that
case?


> +/*
> + * Try to start IO on the first buffer in a new run of blocks. If AIO is in
> + * progress, be it in this backend or another backend, we just associate the
> + * wait reference with the operation and wait in WaitReadBuffers(). This turns
> + * out to be important for performance in two workloads:
> + *
> + * 1) A read stream that has to read the same block multiple times within the
> + *    readahead distance. This can happen e.g. for the table accesses of an
> + *    index scan.
> + *
> + * 2) Concurrent scans by multiple backends on the same relation.
> + *
> + * If we were to synchronously wait for the in-progress IO, we'd not be able
> + * to keep enough I/O in flight.
> + *
> + * If we do find there is ongoing I/O for the buffer, we set up a 1-block
> + * ReadBuffersOperation that WaitReadBuffers then can wait on.
> + *
> + * It's possible that another backend has started IO on the buffer but not yet
> + * set its wait reference. In this case, we have no choice but to wait for
> + * either the wait reference to be valid or the IO to be done.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewReadBufferIO(ReadBuffersOperation *operation,
> +                       Buffer buffer)
> +{

I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
"Continue"? Or "First" & "Additional"?  Or ...


> +    uint64        buf_state;
> +    BufferDesc *desc;
> +
> +    if (BufferIsLocal(buffer))
> +        return PrepareNewLocalReadBufferIO(operation, buffer);
> +
> +    ResourceOwnerEnlarge(CurrentResourceOwner);
> +    desc = GetBufferDescriptor(buffer - 1);
> +
> +    for (;;)
> +    {
> +        buf_state = LockBufHdr(desc);

Perhaps worth adding an
   Assert(buf_state & BM_TAG_VALID)?


> +        /* Already valid, no work to do */
> +        if (buf_state & BM_VALID)
> +        {
> +            UnlockBufHdr(desc);
> +            pgaio_wref_clear(&operation->io_wref);

Perhaps we should clear &operation->io_wref once at the start? Because right
now it'll be cleared if BM_VALID and it'll be set if BM_IO_IN_PROGRESS &&
&desc->io_wref, but it won't be touched when in BM_IO_IN_PROGRESS without a
wref set.  It seems either we should just touch &operation->io_wref if
  BM_IO_IN_PROGRESS && pgaio_wref_valid(&desc->io_wref)
or we should reliably do it.



> +            return READ_BUFFER_ALREADY_DONE;
> +        }
> +
> +        if (buf_state & BM_IO_IN_PROGRESS)
> +        {
> +            /* Join existing read */
> +            if (pgaio_wref_valid(&desc->io_wref))
> +            {
> +                operation->io_wref = desc->io_wref;
> +                operation->foreign_io = true;
> +                UnlockBufHdr(desc);
> +                return READ_BUFFER_IN_PROGRESS;
> +            }
> +
> +            /*
> +             * If the wait ref is not valid but the IO is in progress, someone
> +             * else started IO but hasn't set the wait ref yet. We have no
> +             * choice but to wait until the IO completes.
> +             */
> +            UnlockBufHdr(desc);
> +            pgaio_submit_staged();
> +            WaitIO(desc);
> +            continue;

Before this commit there was an explanation for this submit:

-    /*
-     * If this backend currently has staged IO, we need to submit the pending
-     * IO before waiting for the right to issue IO, to avoid the potential for
-     * deadlocks (and, more commonly, unnecessary delays for other backends).
-     */

Seems that vanished?



> +/*
> + * When building a new IO from multiple buffers, we won't include buffers
> + * that are already valid or already in progress. This function should only be
> + * used for additional adjacent buffers following the head buffer in a new IO.
> + *
> + * Returns true if the buffer was successfully prepared for IO and false if it
> + * is rejected and the read IO should not include this buffer.
> + */
> +static bool
> +PrepareAdditionalReadBuffer(Buffer buffer)

I think it'd be good to mention that this may never wait for IO or such to
avoid deadlocks.



> +    /* Check if we can start IO on the first to-be-read buffer */
> +    if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
> +        READ_BUFFER_READY_FOR_IO)
> +    {

I don't love this < bit. For one there's no mention in
PrepareReadBuffer_Status mentioning that the numerical order is important. Any
reason to not just test != READ_BUFFER_READY_FOR_IO?

The assignment inside the if also looks somewhat awkward. For while() loops
there's often not really a better way to write it, but here you could just as
well do the status assignment in a line before.


> +        pgaio_io_release(ioh);
> +        *nblocks_progress = 1;
> +        if (status == READ_BUFFER_ALREADY_DONE)
> +        {
> +            /*
> +             * Someone else has already completed this block, we're done.
> +             *
> +             * When IO is necessary, ->nblocks_done is updated in
> +             * ProcessReadBuffersResult(), but that is not called if no IO is
> +             * necessary. Thus update here.
> +             */
> +            operation->nblocks_done += 1;
> +            Assert(operation->nblocks_done <= operation->nblocks);
> +
> +            /*
> +             * Report and track this as a 'hit' for this backend, even though
> +             * it must have started out as a miss in PinBufferForBlock(). The
> +             * other backend will track this as a 'read'.
> +             */
> +            CountBufferHit(operation->strategy,
> +                           operation->rel, operation->persistence,
> +                           operation->smgr, operation->forknum,
> +                           operation->blocknum + operation->nblocks_done - 1);
> +            return false;
> +        }
> +
> +        /* The IO is already in-progress */
> +        Assert(status == READ_BUFFER_IN_PROGRESS);
> +        CheckReadBuffersOperation(operation, false);

I was about to suggest that there should be a CheckReadBuffersOperation() for
both returns here, but there already are CheckReadBuffersOperation after calls
to AsyncReadBuffers(), so I think this CheckReadBuffersOperation could just be
removed.



>      /*
> -     * Check if we can start IO on the first to-be-read buffer.
> -     *
> -     * If an I/O is already in progress in another backend, we want to wait
> -     * for the outcome: either done, or something went wrong and we will
> -     * retry.
> +     * How many neighboring-on-disk blocks can we scatter-read into other
> +     * buffers at the same time?  In this case we don't wait if we see an I/O
> +     * already in progress.  We already set BM_IO_IN_PROGRESS for the head
> +     * block, so we should get on with that I/O as soon as possible.
>       */
> -    if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
> +    for (int i = nblocks_done + 1; i < operation->nblocks; i++)
>      {
> -        /*
> -         * Someone else has already completed this block, we're done.
> -         *
> -         * When IO is necessary, ->nblocks_done is updated in
> -         * ProcessReadBuffersResult(), but that is not called if no IO is
> -         * necessary. Thus update here.
> -         */
> -        operation->nblocks_done += 1;
> -        *nblocks_progress = 1;
> -
> -        pgaio_io_release(ioh);
> -        pgaio_wref_clear(&operation->io_wref);
> -        did_start_io = false;
> +        if (!PrepareAdditionalReadBuffer(buffers[i]))
> +            break;
> +        /* Must be consecutive block numbers. */
> +        Assert(BufferGetBlockNumber(buffers[i - 1]) ==
> +               BufferGetBlockNumber(buffers[i]) - 1);

Seems this assert could stand to be before the PrepareAdditionalReadBuffer(),
as it better hold true before we try to BM_IO_IN_PROGRESS?

I realize this is old, but since you're whacking this around...



> diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
> index 4017896f951..f85a9acc6ac 100644
> --- a/src/include/storage/bufmgr.h
> +++ b/src/include/storage/bufmgr.h
> @@ -147,6 +147,8 @@ struct ReadBuffersOperation
>      int            flags;
>      int16        nblocks;
>      int16        nblocks_done;
> +    /* true if waiting on another backend's IO */
> +    bool        foreign_io;
>      PgAioWaitRef io_wref;
>      PgAioReturn io_return;
>  };

This adds an alignment-padding hole between nblocks_done and io_wref.  Read
stream can allocate quite a few of these, so it's probably worth avoiding?

There's a padding hole between persistence and forknum, but that seems a bit
ugly to utilize. A bit less ugly if we swapped forknum and persistence.

Or we could make 'flags' a uint8/16 (flags should imo always be unsigned, and
there are just four flag bits right now).

But perhaps it's also not worth worrying about right now.


[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Attributes.html#index-pure

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Assertion failure in hash_kill_items()
Next
From: Nathan Bossart
Date:
Subject: Re: table AM option passing