Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250325133321.54.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote:
> On 2025-03-24 17:45:37 -0700, Noah Misch wrote:
> > (We may be due for a test mode that does smgrreleaseall() at every
> > CHECK_FOR_INTERRUPTS()?)
> 
> I suspect we are. I'm a bit afraid of even trying...
> 
> ...
> 
> It's extremely slow - but at least the main regression as well as the aio tests pass!

One less thing!

> > > I however don't particularly like the *start* or *prep* names, I've gone back
> > > and forth on those a couple times. I could see "begin" work uniformly across
> > > those.
> > 
> > For ease of new readers understanding things, I think it helps for the
> > functions that advance PgAioHandleState to have names that use words from
> > PgAioHandleState.  It's one less mapping to get into the reader's head.
> 
> Unfortunately for me it's kind of the opposite in this case, see below.
> 
> 
> > "Begin", "Start" and "prep" are all outside that taxonomy, making the reader
> > learn how to map them to the taxonomy.  What reward does the reader get at the
> > end of that exercise?  I'm not seeing one, but please do tell me what I'm
> > missing here.
> 
> Because the end state varies, depending on the number of previously staged
> IOs, the IO method and whether batchmode is enabled, I think it's better if
> the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is
> *not* aligned with an internal state name.  It will just mislead readers to
> think that there's a deterministic mapping when that does not exist.

That's fair.  Could we provide the mapping in a comment, something like the
following?

--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -34,5 +34,10 @@
  * linearly through all states.
  *
- * State changes should all go through pgaio_io_update_state().
+ * State changes should all go through pgaio_io_update_state().  Its callers
+ * use these naming conventions:
+ *
+ * - A "start" function (e.g. FileStartReadV()) moves an IO from
+ *   PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most
+ *   PGAIO_HS_COMPLETED_LOCAL.
  */
 typedef enum PgAioHandleState

> That's not an excuse for pgaio_io_prep* though, that's a pointlessly different
> naming that I just stopped seeing.
> 
> I'll try to think more about this, perhaps I can make myself see your POV
> more.

> > As the patch stands, LockBufferForCleanup() can succeed when
> > ConditionalLockBufferForCleanup() would have returned false.
> 
> That's already true today, right? In master ConditionalLockBufferForCleanup()
> for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't.

I'm finding a LocalRefCount check under LockBufferForCleanup:

LockBufferForCleanup(Buffer buffer)
{
...
    CheckBufferIsPinnedOnce(buffer);

CheckBufferIsPinnedOnce(Buffer buffer)
{
    if (BufferIsLocal(buffer))
    {
        if (LocalRefCount[-buffer - 1] != 1)
            elog(ERROR, "incorrect local pin count: %d",
                 LocalRefCount[-buffer - 1]);
    }
    else
    {
        if (GetPrivateRefCount(buffer) != 1)
            elog(ERROR, "incorrect local pin count: %d",
                 GetPrivateRefCount(buffer));
    }
}

> > Like the comment, I expect it's academic today.  I expect it will stay
> > academic.  Anything that does a cleanup will start by reading the buffer,
> > which will resolve any refcnt the AIO subsystems holds for a read.  If there's
> > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on
> > that.  How about just removing the ConditionalLockBufferForCleanup() changes
> > or replacing them with a comment (like the present paragraph)?
> 
> I think we'll need an expanded version of what I suggest once we have writes -
> but as you say, it shouldn't matter as long as we only have reads. So I think
> moving the relevant changes, with adjusted caveats, to the bufmgr: write
> change makes sense.

Moving those changes works for me.  I'm not currently seeing the need under
writes, but that may get clearer upon reaching those patches.

> > > > /* ---
> > > >  * Opt-in to using AIO batchmode.
> > > >  *
> > > >  * Submitting IO in larger batches can be more efficient than doing so
> > > >  * one-by-one, particularly for many small reads. It does, however, require
> > > >  * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
> > > >  * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
> > > >  * a) block without first calling pgaio_submit_staged(), unless a
> > > >  *    to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
> > > >  *    never acquired in a nested fashion
> > > >  * b) directly or indirectly start another batch pgaio_enter_batchmode()
> > 
> > I think a callback could still do:
> > 
> >   pgaio_exit_batchmode()
> >   ... arbitrary code that might reach pgaio_enter_batchmode() ...
> >   pgaio_enter_batchmode()
> 
> Yea - but I somehow doubt there are many cases where it makes sense to
> deep-queue IOs within the callback. The cases I can think of are things like
> ensuring the right VM buffer is in s_b.  But if it turns out to be necessary,
> what you seuggest would be an out.

I don't foresee a callback specifically wanting to batch, but callbacks might
call into other infrastructure that can elect to batch.  The exit+reenter
pattern would be better than adding no-batch options to other infrastructure.

> Do you think it's worth mentioning the above workaround? I'm mildly inclined
> that not.

Perhaps not in that detail, but perhaps we can rephrase (b) to not imply
exit+reenter is banned.  Maybe "(b) start another batch (without first exiting
one)".  It's also fine as-is, though.

> If it turns out to be actually useful to do nested batching, we can change it
> so that nested batching *is* allowed, that'd not be hard.

Good point.

> I'm ok with all of these. In order of preference:
> 
> 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK
> 2) READ_STREAM_BATCHMODE_AWARE
> 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE

Same for me.



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Improve CRC32C performance on SSE4.2
Next
From: Peter Eisentraut
Date:
Subject: Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints