Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250324002939.5c.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
commit 247ce06b wrote:
> +            pgaio_io_reopen(ioh);
> +
> +            /*
> +             * To be able to exercise the reopen-fails path, allow injection
> +             * points to trigger a failure at this point.
> +             */
> +            pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> +
> +            error_errno = 0;
> +            error_ioh = NULL;
> +
> +            /*
> +             * We don't expect this to ever fail with ERROR or FATAL, no need
> +             * to keep error_ioh set to the IO.
> +             * pgaio_io_perform_synchronously() contains a critical section to
> +             * ensure we don't accidentally fail.
> +             */
> +            pgaio_io_perform_synchronously(ioh);

A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores.  Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code.  It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Attached v2.11

> Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

> +int
> +FileStartReadV(PgAioHandle *ioh, File file,
> +               int iovcnt, off_t offset,
> +               uint32 wait_event_info)
> +{
> +    int            returnCode;
> +    Vfd           *vfdP;
> +
> +    Assert(FileIsValid(file));
> +
> +    DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> +               file, VfdCache[file].fileName,
> +               (int64) offset,
> +               iovcnt));
> +
> +    returnCode = FileAccess(file);
> +    if (returnCode < 0)
> +        return returnCode;
> +
> +    vfdP = &VfdCache[file];
> +
> +    pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);

FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode.  I didn't expect that from
functions so named.  The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED.  I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect.  Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread.  How do
you see it?

> +/*
> + * AIO error reporting callback for mdstartreadv().
> + *
> + * Errors are encoded as follows:
> + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".


> Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design

Ready for commit apart from some trivia:

> +if (ioret.result.status == PGAIO_RS_ERROR)
> +    pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);

I think ioret and aio_ret are supposed to be the same object.  If that's
right, change one of the names.  Likewise elsewhere in this file.

> +The central API piece for postgres' AIO abstraction are AIO handles. To
> +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
> +then "defined", i.e. associate an IO operation with the handle.

s/"defined"/"define" it/ or similar

> +The "solution" to this the ability to associate multiple completion callbacks

s/this the/this is the/


> Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well

> @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
>          Assert(refcount > 0);
>          if (refcount != 1)
>              return false;
> +
> +        /*
> +         * Check that the AIO subsystem doesn't have a pin. Likely not
> +         * possible today, but better safe than sorry.
> +         */
> +        bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> +        buf_state = pg_atomic_read_u32(&bufHdr->state);
> +        refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> +        Assert(refcount > 0);
> +        if (refcount != 1)
> +            return false;
> +

LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends.  Currently, it just assumes all local buffers are
cleanup-lockable:

    /* Nobody else to wait for */
    if (BufferIsLocal(buffer))
        return;

> @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
>  
>      buf_state = pg_atomic_read_u32(&bufHdr->state);
>  
> -    if (check_unreferenced && LocalRefCount[bufid] != 0)
> +    /*
> +     * We need to test not just LocalRefCount[bufid] but also the BufferDesc
> +     * itself, as the latter is used to represent a pin by the AIO subsystem.
> +     * This can happen if AIO is initiated and then the query errors out.
> +     */
> +    if (check_unreferenced &&
> +        (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0))
>          elog(ERROR, "block %u of %s is still referenced (local %u)",

I didn't write a test to prove it, but I'm suspecting we'll reach the above
ERROR with this sequence:

  CREATE TEMP TABLE foo ...;
  [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing]
  DROP TABLE foo;

DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true).  I
think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for
the particular rel) before InvalidateLocalBuffer().  Or use something like the
logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in
corresponding bufmgr code.  I think that bufmgr ERROR is unreachable, since
only a private refcnt triggers that bufmgr ERROR.  Is there something
preventing the localbuf error from being a problem?  (This wouldn't require
changes to the current patch; responsibility would fall in a bufmgr AIO
patch.)


> Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support

(Still reviewing this and later patches, but incidental observations follow.)

> +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> +                          bool failed, bool is_temp)
> +{
...
> +    PgAioResult result;
...
> +    result.status = PGAIO_RS_OK;
...
> +    return result;

gcc 14.2.0 -Werror gives me:

  bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized]

Zeroing the unset fields silenced it:

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -7221,3 +7221,3 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
     char       *bufdata = BufferGetBlock(buffer);
-    PgAioResult result;
+    PgAioResult result = { .status = PGAIO_RS_OK };
     uint32        set_flag_bits;
@@ -7238,4 +7238,2 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
 
-    result.status = PGAIO_RS_OK;
-
     /* check for garbage data */


> Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO

> @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream)
>  static void
>  read_stream_look_ahead(ReadStream *stream)
>  {
> +    /*
> +     * Allow amortizing the cost of submitting IO over multiple IOs. This
> +     * requires that we don't do any operations that could lead to a deadlock
> +     * with staged-but-unsubmitted IO.
> +     */
> +    pgaio_enter_batchmode();

We call read_stream_get_block() while in batchmode, so the stream callback
needs to be ready for that.  A complicated case is
collect_corrupt_items_read_stream_next_block(), which may do its own buffer
I/O to read in a vmbuffer for VM_ALL_FROZEN().  That's feeling to me like a
recipe for corner cases reaching ERROR "starting batch while batch already in
progress".  Are there mitigating factors?


> Subject: [PATCH v2.11 17/27] aio: Add test_aio module

> +    # verify that page verification errors are detected even as part of a
> +    # shortened multi-block read (tbl_corr, block 1 is tbl_corred)

Is "tbl_corred" a typo of something?

> --- /dev/null
> +++ b/src/test/modules/test_aio/test_aio.c
> @@ -0,0 +1,657 @@
> +/*-------------------------------------------------------------------------
> + *
> + * delay_execution.c
> + *        Test module to allow delay between parsing and execution of a query.
> + *
> + * The delay is implemented by taking and immediately releasing a specified
> + * advisory lock.  If another process has previously taken that lock, the
> + * current process will be blocked until the lock is released; otherwise,
> + * there's no effect.  This allows an isolationtester script to reliably
> + * test behaviors where some specified action happens in another backend
> + * between parsing and execution of any desired query.
> + *
> + * Copyright (c) 2020-2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + *      src/test/modules/delay_execution/delay_execution.c

Header comment is surviving from copy-paste of delay_execution.c.

> +     * Tor tests we don't want the resowner release preventing us from

s/Tor/For/



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Proposal - Allow extensions to set a Plan Identifier
Next
From: Richard Guo
Date:
Subject: Re: Fix infinite loop from setting scram_iterations to INT_MAX