Thread: Re: AIO v2.2

Re: AIO v2.2

From

Andres Freund

Date:

07 January, 19:11:45

Hi,

On 2025-01-07 17:09:58 +0200, Heikki Linnakangas wrote:
> On 01/01/2025 06:03, Andres Freund wrote:
> > Hi,
> > 
> > Attached is a new version of the AIO patchset.
> 
> I haven't gone through it all yet, but some comments below.

Thanks!

> > The biggest changes are:
> > 
> > - The README has been extended with an overview of the API. I think it gives a
> >    good overview of how the API fits together.  I'd be very good to get
> >    feedback from folks that aren't as familiar with AIO, I can't really see
> >    what's easy/hard anymore.
> 
> Thanks, the README is super helpful! I was overwhelmed by all the new
> concepts before, now it all makes much more sense.
> 
> Now that it's all laid out more clearly, I see how many different concepts
> and states there really are:
> 
> - For a single IO, there is an "IO handle", "IO references", and an "IO
> return". You first allocate an IO handle (PgAioHandle), and then you get a
> reference (PgAioHandleRef) and an "IO return" (PgAioReturn) struct for it.
> 
> - An IO handle has eight different states (PgAioHandleState).
> 
> I'm sure all those concepts exist for a reason. But still I wonder: can we
> simplify?

Probably, but it's not exactly obvious to me where.

The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.

Having PgAioReturn be separate from the AIO handle turns out to be rather
crucial, otherwise it's very hard to guarantee "forward progress",
i.e. guarantee that pgaio_io_get() will return something without blocking
forever.

> pgaio_io_get() and pgaio_io_release() are a bit asymmetric, I'd suggest
> pgaio_io_acquire() or similar. "get" also feels very innocent, even though
> it may wait for previous IO to finish. Especially when pgaio_io_get_ref()
> actually is innocent.

WFM.

> > typedef enum PgAioHandleState
> > {
> >     /* not in use */
> >     AHS_IDLE = 0,
> > 
> >     /* returned by pgaio_io_get() */
> >     AHS_HANDED_OUT,
> > 
> >     /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
> >     AHS_DEFINED,
> > 
> >     /* subject's prepare() callback has been called */
> >     AHS_PREPARED,
> > 
> >     /* IO has been submitted and is being executed */
> >     AHS_IN_FLIGHT,
> > 
> >     /* IO finished, but result has not yet been processed */
> >     AHS_REAPED,
> > 
> >     /* IO completed, shared completion has been called */
> >     AHS_COMPLETED_SHARED,
> > 
> >     /* IO completed, local completion has been called */
> >     AHS_COMPLETED_LOCAL,
> > } PgAioHandleState;
> 
> Do we need to distinguish between DEFINED and PREPARED?

I found it to be rather confusing if it's not possible to tell if some action
(like the prepare callback) has already happened, or not.  It's useful to be
able look at an IO in a backtrace or such and see exactly in what state it is
in.

In v1 I had several of the above states managed as separate boolean variables
- that turned out to be a huge mess, it's a lot easier to understand if
there's a single strictly monotonically increasing state.

> At quick glance, those states are treated the same. (The comment refers to
> pgaio_io_start_*() functions, but there's no such thing)

They're called pgaio_io_prep_{readv,writev} now, updated the comment.

> I didn't quite understand the point of the prepare callbacks. For example,
> when AsyncReadBuffers() calls smgrstartreadv(), the
> shared_buffer_readv_prepare() callback will be called. Why doesn't
> AsyncReadBuffers() do the "prepare" work itself directly; why does it need
> to be in a callback?

One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc.  Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.

> I assume it's somehow related to error handling, but I didn't quite get
> it. Perhaps an "abort" callback that'd be called on error, instead of a
> "prepare" callback, would be better?

I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).

> There are some synonyms used in the code: I think "in-flight" and
> "submitted" mean the same thing.

Fair.  I guess in my mind the process of moving an IO into flight is
"submitting" and the state of not having been submitted but not yet having
completed is being in flight. But that's probably not useful.

> And "prepared" and "staged". I'd suggest picking just one term for each
> concept.

Agreed.

> I didn't understand the COMPLETED_SHARED and COMPLETED_LOCAL states. does a
> single IO go through both states, or are the mutually exclusive? At quick
> glance, I don't actually see any code that would set the COMPLETED_LOCAL
> state; is it dead code?

It's dead code right now. I've made it dead and undead a couple times
:/. Unfortunately I think I need to revive it to make some corner cases with
temporary tables work (AIO for temp table is executed via IO uring, another
backend waits for *another* IO executed via that IO uring instance and reaps
the completion -> we can't update the local buffer state in the shared
completion callback).

> REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
> but it's actually vice versa.

What would you call having gotten "completion notifications" from the kernel,
but not having processed them?

> > - There's no obvious way to tell "internal" function operating on an IO handle
> >    apart from functions that are expected to be called by the issuer of an IO.
> > 
> >    One way to deal with this would be to introduce a distinct "issuer IO
> >    reference" type.  I think that might be a good idea, it would also make it
> >    clearer that a good number of the functions can only be called by the
> >    issuer, before the IO is submitted.
> > 
> >    This would also make it easier to order functions more sensibly in aio.c, as
> >    all the issuer functions would be together.
> > 
> >    The functions on AIO handles that everyone can call already have a distinct
> >    type (PgAioHandleRef vs PgAioHandle*).
> 
> Hmm, yeah I think you might be onto something here.

I'll give it a try.

> Could pgaio_io_get() return an PgAioHandleRef directly, so that the issuer
> would never see a raw PgAioHandle ?

Don't think that would be helpful - that way there'd be no difference at all
anymore between what functions any backend can call and what the issuer can
do.

>
> Finally, attached are a couple of typos and other trivial suggestions.

Integrating...

Thanks!

Andres

Re: AIO v2.2

From

Robert Haas

Date:

07 January, 22:59:58

On Tue, Jan 7, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
> The difference between a handle and a reference is useful right now, to have
> some separation between the functions that can be called by anyone (taking a
> PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
> solved by having a PgAioHandleIssuerRef ref or something.

To me, those names don't convey that. I would perhaps call the thing
that supports issuer-only operations a "PgAio" and the thing other
people can use a "PgAioHandle". Or "PgAioRequest" and "PgAioHandle" or
something like that. With PgAioHandleRef, IMHO you've got two words
that both imply a layer of indirection -- "handle" and "ref" -- which
doesn't seem quite as nice, because then the other thing --
"PgAioHandle" still sort of implies
one layer of indirection and the whole thing seems a bit less clear.

(I say all of this having looked at nothing, so feel free to ignore me
if that doesn't sound coherent.)

> > REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
> > but it's actually vice versa.
>
> What would you call having gotten "completion notifications" from the kernel,
> but not having processed them?

The Linux kernel calls those zombie processes, so we could call it a
ZOMBIE state, but that seems like it might be a bit of inside
baseball. I do agree with Heikki that REAPED sounds later than
COMPLETED, because you reap zombie processes by collecting their exit
status. Maybe you could have AHS_COMPLETE or AHS_IO_COMPLETE for the
state where the I/O is done but there's still completion-related work
to be done, and then the other state could be AHS_DONE or AHS_FINISHED
or AHS_FINAL or AHS_REAPED or something.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: AIO v2.2

From

Andres Freund

Date:

09 January, 01:56:04

Hi,

On 2025-01-07 22:09:56 +0200, Heikki Linnakangas wrote:
> On 07/01/2025 18:11, Andres Freund wrote:
> > > I didn't quite understand the point of the prepare callbacks. For example,
> > > when AsyncReadBuffers() calls smgrstartreadv(), the
> > > shared_buffer_readv_prepare() callback will be called. Why doesn't
> > > AsyncReadBuffers() do the "prepare" work itself directly; why does it need
> > > to be in a callback?
> >
> > One big part of it is "ownership" - while the IO isn't completely "assembled",
> > we can release all buffer pins etc in case of an error. But if the error
> > happens just after the IO was staged, we can't - the buffer is still
> > referenced by the IO. For that the AIO subystem needs to take its own pins
> > etc.  Initially the prepare callback didn't exist, the code in
> > AsyncReadBuffers() was a lot more complicated before it.
> >
> >
> > > I assume it's somehow related to error handling, but I didn't quite get
> > > it. Perhaps an "abort" callback that'd be called on error, instead of a
> > > "prepare" callback, would be better?
> >
> > I don't think an error callback would be helpful - the whole thing is that we
> > basically need claim ownership of all IO related resources IFF the IO is
> > staged. Not before (because then the IO not getting staged would mean we have
> > a resource leak), not after (because we might error out and thus not keep
> > e.g. buffers pinned).
>
> Hmm. The comments say that when you call smgrstartreadv(), the IO handle may
> no longer be modified, as the IO may be executed immediately. What if we
> changed that so that it never submits the IO, only adds the necessary
> callbacks to it?

> In that world, when smgrstartreadv() returns, the necessary details and
> completion callbacks have been set in the IO handle, but the caller can
> still do more preparation before the IO is submitted. The caller must ensure
> that it gets submitted, however, so no erroring out in that state.
>
> Currently the call stack looks like this:
>
> AsyncReadBuffers()
> -> smgrstartreadv()
>   -> mdstartreadv()
>     -> FileStartReadV()
>       -> pgaio_io_prep_readv()
>         -> shared_buffer_readv_prepare() (callback)
>         <- (return)
>       <- (return)
>     <- (return)
>   <- (return)
> <- (return)
>
> I'm thinking that the prepare work is done "on the way up" instead:
>
> AsyncReadBuffers()
> -> smgrstartreadv()
>   -> mdstartreadv()
>     -> FileStartReadV()
>       -> pgaio_io_prep_readv()
>       <- (return)
>     <- (return)
>   <- (return)
> -> shared_buffer_readv_prepare()
> <- (return)
>
> Attached is a patch to demonstrate concretely what I mean.

I think this would be somewhat limiting. Right now it's indeed just bufmgr.c
that needs to do a preparation (or "moving of ownership") step - but I don't
think it's necessarily going to stay that way.

Consider e.g. a hypothetical threaded future in which we have refcounted file
descriptors. While AIO is ongoing, the AIO subsystem would need to ensure that
the FD refcount is increased, otherwise you'd obviously run into trouble if
the issuing backend errored out and released its own reference as part of
resowner release.

I don't think the approach you suggest above would scale well for such a
situation - shared_buffer_readv_prepare() would again need to call to
smgr->md->fd.  Whereas with the current approach md.c (or fd.c?) could just
define its own prepare callback that increased the refcount at the right
moment.


There's a few other scenarios I can think of:

- If somebody were - no idea what made me think of that - to write an smgr
  implementation where storage is accessed over the network, one might need to
  keep network buffers and sockets alive for the duration of the IO.

- It'd be rather useful to have support for asynchronously extending a
  relation, that often requires filesystem journal IO and thus is slow. If
  you're bulk loading, or the extension lock is contented, it'd be great if we
  could start the next relation extension *before* it's needed and the
  extension has to happen synchronously.  To avoid deadlocks, such an
  asynchronous extension would need to be able to release the lock in any
  other backend, just like it's needed for the content locks when
  asynchronously writing.  Which in turn would require transferring ownership
  of the relevant buffers *and* the extension lock.  You could mash this
  together, but it seems like a separate callback woul make it more
  composable.


Does that make any sense to you?


> This adds a new pgaio_io_stage() step to the issuer, and the issuer needs to
> call the prepare functions explicitly, instead of having them as callbacks.
> Nominally that's more steps, but IMHO it's better to be explicit. The same
> actions were happening previously too, it was just hidden in the callback. I
> updated the README to show that too.
>
> I'm not wedded to this, but it feels a little better to me.

Right now the current approach seems to make more sense to me, but I'll think
about it more. I might also have missed something with my theorizing above.


Greetings,

Andres Freund

Re: AIO v2.2

From

Andres Freund

Date:

09 January, 03:26:36

Hi,

On 2025-01-07 14:59:58 -0500, Robert Haas wrote:
> On Tue, Jan 7, 2025 at 11:11 AM Andres Freund <andres@anarazel.de> wrote:
> > The difference between a handle and a reference is useful right now, to have
> > some separation between the functions that can be called by anyone (taking a
> > PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
> > solved by having a PgAioHandleIssuerRef ref or something.
>
> To me, those names don't convey that.

I'm certainly not wedded to these names - I went back and forth between
different names a fair bit, because I wasn't quite happy. I am however certain
that the current names are better than what it used to be (PgAioInProgress and
because that's long, a bunch of PgAioIP* names) :)

To make sure were talking about the same things, I am thinking of the
following "entities" needing names:

1) Shared memory representation of an IO, for the AIO subsystem internally

   Currently: PgAioHandle

   Because shared memory is limited, we need to reuse this entity. This reuse
   needs to be possible "immediately" after completion, to avoid a bunch of
   nasty scenarios.

   To distinguish a reused PgAioHandle from its "prior" incarnation, each
   PgAioHandle has a 64bit "generation counter.

   In addition to being referenceable via pointer, it's also possible to
   assign a 32bit integer to each PgAioHandle, as there is a fixed number of
   them.

2) A way for the issuer of an IO to reference 1), to attach information to the
   IO

   Currently: PgAioHandle*

   As long as the issuer hasn't yet staged the IO, it can't be
   reused. Therefore it's OK to just point to the PgAioHandle.

   One disadvantage of just using a pointer to PgAioHandle* is that it's
   harder to distinguish subystem-internal functions that accept PgAioHandle*
   from "public" functions that accept the "issuer reference".

3) A way for any backend to wait for a specific IO to complete

   Currently: PgAioHandleRef

   This references 1) using a 32 bit ID and the 64bit generation.

   This is used to allow any backend to wait for a specific IO to
   complete. E.g. by including it in the BufferDesc so that WaitIO can wait
   for it.

   Because it includes the generation it's trivial to detect whether the
   PgAioHandle was reused.

> I would perhaps call the thing that supports issuer-only operations a
> "PgAio" and the thing other people can use a "PgAioHandle". Or
> "PgAioRequest" and "PgAioHandle" or something like that. With
> PgAioHandleRef, IMHO you've got two words that both imply a layer of
> indirection -- "handle" and "ref" -- which doesn't seem quite as nice,
> because then the other thing -- "PgAioHandle" still sort of implies one
> layer of indirection and the whole thing seems a bit less clear.

It's indirections all the way down. The PG representation of "one IO" in the
end is just an indirection for a kernel operation :)

I would like to split 1) and 2) above.

1) PgAio{Handle,Request,} (a large struct)  - used internally by AIO subsystem,
   "pointed to" by the following
2) PgAioIssuerRef (an ID or pointer) - used by the issuer to incrementally
   define the IO
3) PgAioWaitRef - (an ID and generation) - used to wait for a specific IO to
   complete, not affected by reuse of PgAioHandle

> > > REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
> > > but it's actually vice versa.
> >
> > What would you call having gotten "completion notifications" from the kernel,
> > but not having processed them?
>
> The Linux kernel calls those zombie processes, so we could call it a ZOMBIE
> state, but that seems like it might be a bit of inside baseball.

ZOMBIE feels even later than REAPED to me :)

> I do agree with Heikki that REAPED sounds later than COMPLETED, because you
> reap zombie processes by collecting their exit status. Maybe you could have
> AHS_COMPLETE or AHS_IO_COMPLETE for the state where the I/O is done but
> there's still completion-related work to be done, and then the other state
> could be AHS_DONE or AHS_FINISHED or AHS_FINAL or AHS_REAPED or something.

How about

AHS_COMPLETE_KERNEL or AHS_COMPLETE_RAW - raw syscall completed
AHS_COMPLETE_SHARED_CB - shared callback completed
AHS_COMPLETE_LOCAL_CB - local callback completed

?

Greetings,

Andres Freund