Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250325155808.f7.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Tue, Mar 25, 2025 at 11:26:14AM -0400, Andres Freund wrote:
> On 2025-03-25 06:33:21 -0700, Noah Misch wrote:
> > On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> > > On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > > > (We may be due for a test mode that does smgrreleaseall() at every
> > > > CHECK_FOR_INTERRUPTS()?)
> > >
> > > I suspect we are. I'm a bit afraid of even trying...
> > >
> > > ...
> > >
> > > It's extremely slow - but at least the main regression as well as the aio tests pass!
> >
> > One less thing!
> 
> Unfortunately I'm now doubting the thoroughness of my check - while I made
> every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we
> trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if
> log_min_messages <= DEBUGN...
> 
> I'll try that in a bit.

While having nagging thoughts that we might be releasing FDs before io_uring
gets them into kernel custody, I tried this hack to maximize FD turnover:

static void
ReleaseLruFiles(void)
{
#if 0
    while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds)
    {
        if (!ReleaseLruFile())
            break;
    }
#else
    while (ReleaseLruFile())
        ;
#endif
}

"make check" with default settings (io_method=worker) passes, but
io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two
runs.  s/#if 0/#if 1/ (restore normal FD turnover) removes the failures.
Here's the richer of the two diffs:

diff -U3 src/test/regress/expected/sanity_check.out src/test/regress/results/sanity_check.out
--- src/test/regress/expected/sanity_check.out  2024-10-24 12:43:25.741817594 -0700
+++ src/test/regress/results/sanity_check.out   2025-03-25 08:27:44.875151566 -0700
@@ -1,4 +1,7 @@
 VACUUM;
+ERROR:  index "pg_enum_oid_index" contains corrupted page at block 2
+HINT:  Please REINDEX it.
+CONTEXT:  while vacuuming index "pg_enum_oid_index" of relation "pg_catalog.pg_enum"
 --
 -- Sanity check: every system catalog that has OIDs should have
 -- a unique index on OID.  This ensures that the OIDs will be unique,
diff -U3 src/test/regress/expected/oidjoins.out src/test/regress/results/oidjoins.out
--- src/test/regress/expected/oidjoins.out      2023-07-06 19:58:07.686364439 -0700
+++ src/test/regress/results/oidjoins.out       2025-03-25 08:28:02.584335458 -0700
@@ -233,6 +233,8 @@
 NOTICE:  checking pg_policy {polrelid} => pg_class {oid}
 NOTICE:  checking pg_policy {polroles} => pg_authid {oid}
 NOTICE:  checking pg_default_acl {defaclrole} => pg_authid {oid}
+WARNING:  FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,5)",0)
+WARNING:  FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,7)",402654464)
 NOTICE:  checking pg_default_acl {defaclnamespace} => pg_namespace {oid}
 NOTICE:  checking pg_init_privs {classoid} => pg_class {oid}
 NOTICE:  checking pg_seclabel {classoid} => pg_class {oid}

> > > Because the end state varies, depending on the number of previously staged
> > > IOs, the IO method and whether batchmode is enabled, I think it's better if
> > > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> > > *not* aligned with an internal state name.  It will just mislead readers to
> > > think that there's a deterministic mapping when that does not exist.
> >
> > That's fair.  Could we provide the mapping in a comment, something like the
> > following?
> 
> Yes!
> 
> I wonder if it should also be duplicated or referenced elsewhere, although I
> am not sure where precisely.

I considered the README.md also, but adding that wasn't an obvious win.

> > --- a/src/include/storage/aio_internal.h
> > +++ b/src/include/storage/aio_internal.h
> > @@ -34,5 +34,10 @@
> >   * linearly through all states.
> >   *
> > - * State changes should all go through pgaio_io_update_state().
> > + * State changes should all go through pgaio_io_update_state().  Its callers
> > + * use these naming conventions:
> > + *
> > + * - A "start" function (e.g. FileStartReadV()) moves an IO from
> > + *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
> > + *   PGAIO_HS_COMPLETED_LOCAL.
> >   */
> >  typedef enum PgAioHandleState
> 
> One detail I'm not sure about: The above change is correct, but perhaps a bit
> misleading, because we can actually go "back" to IDLE. Not sure how to best
> phrase that though.

Not sure either.  Maybe the above could change to "to PGAIO_HS_STAGED or any
subsequent state" and the comment at PGAIO_HS_STAGED could say like "Once in
this state, concurrent activity could move the IO all the way to
PGAIO_HS_COMPLETED_LOCAL and recycle it back to IDLE."

> > > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> > > naming that I just stopped seeing.
> 
> I assume you're on board with renaming _io_prep* to _io_start_*?

Yes.

> > > I'll try to think more about this, perhaps I can make myself see your POV
> > > more.

> > CheckBufferIsPinnedOnce(Buffer buffer)
> > {
> >     if (BufferIsLocal(buffer))
> >     {
> >         if (LocalRefCount[-buffer - 1] != 1)
> >             elog(ERROR, "incorrect local pin count: %d",
> >                  LocalRefCount[-buffer - 1]);
> >     }
> >     else
> >     {
> >         if (GetPrivateRefCount(buffer) != 1)
> >             elog(ERROR, "incorrect local pin count: %d",
> >                  GetPrivateRefCount(buffer));
> >     }
> > }
> 
> Pretty random orthogonal thought, that I was reminded of by the above code
> snippet:
> 
> It sure seems we should at some point get rid of LocalRefCount[] and just use
> the GetPrivateRefCount() infrastructure for both shared and local buffers.  I
> don't think the GetPrivateRefCount() infrastructure cares about
> local/non-local, leaving a few asserts aside.  If we do that, and start to use
> BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of
> differences between shared and local buffers would be a lot smaller.

That sounds promising.

> > > > Like the comment, I expect it's academic today.  I expect it will stay
> > > > academic.  Anything that does a cleanup will start by reading the buffer,
> > > > which will resolve any refcnt the AIO subsystems holds for a read.  If there's
> > > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > > > that.  How about just removing the ConditionalLockBufferForCleanup() changes
> > > > or replacing them with a comment (like the present paragraph)?
> > >
> > > I think we'll need an expanded version of what I suggest once we have writes -
> > > but as you say, it shouldn't matter as long as we only have reads. So I think
> > > moving the relevant changes, with adjusted caveats, to the bufmgr: write
> > > change makes sense.
> >
> > Moving those changes works for me.  I'm not currently seeing the need under
> > writes, but that may get clearer upon reaching those patches.
> 
> FWIW, I don't think it's currently worth looking at the write side in detail,

Got it.  (I meant I didn't see a first-principles need, not that I had deduced
lack of need from a specific writes implementation.)

> > > Do you think it's worth mentioning the above workaround? I'm mildly inclined
> > > that not.
> >
> > Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
> > exit+reenter is banned.  Maybe "(b) start another batch (without first exiting
> > one)".  It's also fine as-is, though.
> 
> I updated it to:
> 
>  * b) start another batch (without first exiting batchmode and re-entering
>  *    before returning)

That's good.



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: AIO v2.5
Next
From: Álvaro Herrera
Date:
Subject: simplifying grammar for ALTER CONSTRAINT .. SET [NO] INHERIT