Home > mailing lists
Re: AIO v2.2 - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: AIO v2.2
Date	January 7 19:11:45
Msg-id	6xccmmosuto25zpuzlqderggcpekmi7va7cs52wzjmbvtvp3ic@gfmoe2lvdfoy Whole thread Raw
Responses	Re: AIO v2.2
List	pgsql-hackers
Tree view
Hi,

On 2025-01-07 17:09:58 +0200, Heikki Linnakangas wrote:
> On 01/01/2025 06:03, Andres Freund wrote:
> > Hi,
> > 
> > Attached is a new version of the AIO patchset.
> 
> I haven't gone through it all yet, but some comments below.

Thanks!


> > The biggest changes are:
> > 
> > - The README has been extended with an overview of the API. I think it gives a
> >    good overview of how the API fits together.  I'd be very good to get
> >    feedback from folks that aren't as familiar with AIO, I can't really see
> >    what's easy/hard anymore.
> 
> Thanks, the README is super helpful! I was overwhelmed by all the new
> concepts before, now it all makes much more sense.
> 
> Now that it's all laid out more clearly, I see how many different concepts
> and states there really are:
> 
> - For a single IO, there is an "IO handle", "IO references", and an "IO
> return". You first allocate an IO handle (PgAioHandle), and then you get a
> reference (PgAioHandleRef) and an "IO return" (PgAioReturn) struct for it.
> 
> - An IO handle has eight different states (PgAioHandleState).
> 
> I'm sure all those concepts exist for a reason. But still I wonder: can we
> simplify?

Probably, but it's not exactly obvious to me where.

The difference between a handle and a reference is useful right now, to have
some separation between the functions that can be called by anyone (taking a
PgAioHandleRef) and only by the issuer (PgAioHandle). That might better be
solved by having a PgAioHandleIssuerRef ref or something.

Having PgAioReturn be separate from the AIO handle turns out to be rather
crucial, otherwise it's very hard to guarantee "forward progress",
i.e. guarantee that pgaio_io_get() will return something without blocking
forever.



> pgaio_io_get() and pgaio_io_release() are a bit asymmetric, I'd suggest
> pgaio_io_acquire() or similar. "get" also feels very innocent, even though
> it may wait for previous IO to finish. Especially when pgaio_io_get_ref()
> actually is innocent.

WFM.


> > typedef enum PgAioHandleState
> > {
> >     /* not in use */
> >     AHS_IDLE = 0,
> > 
> >     /* returned by pgaio_io_get() */
> >     AHS_HANDED_OUT,
> > 
> >     /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
> >     AHS_DEFINED,
> > 
> >     /* subject's prepare() callback has been called */
> >     AHS_PREPARED,
> > 
> >     /* IO has been submitted and is being executed */
> >     AHS_IN_FLIGHT,
> > 
> >     /* IO finished, but result has not yet been processed */
> >     AHS_REAPED,
> > 
> >     /* IO completed, shared completion has been called */
> >     AHS_COMPLETED_SHARED,
> > 
> >     /* IO completed, local completion has been called */
> >     AHS_COMPLETED_LOCAL,
> > } PgAioHandleState;
> 
> Do we need to distinguish between DEFINED and PREPARED?

I found it to be rather confusing if it's not possible to tell if some action
(like the prepare callback) has already happened, or not.  It's useful to be
able look at an IO in a backtrace or such and see exactly in what state it is
in.

In v1 I had several of the above states managed as separate boolean variables
- that turned out to be a huge mess, it's a lot easier to understand if
there's a single strictly monotonically increasing state.


> At quick glance, those states are treated the same. (The comment refers to
> pgaio_io_start_*() functions, but there's no such thing)

They're called pgaio_io_prep_{readv,writev} now, updated the comment.


> I didn't quite understand the point of the prepare callbacks. For example,
> when AsyncReadBuffers() calls smgrstartreadv(), the
> shared_buffer_readv_prepare() callback will be called. Why doesn't
> AsyncReadBuffers() do the "prepare" work itself directly; why does it need
> to be in a callback?

One big part of it is "ownership" - while the IO isn't completely "assembled",
we can release all buffer pins etc in case of an error. But if the error
happens just after the IO was staged, we can't - the buffer is still
referenced by the IO. For that the AIO subystem needs to take its own pins
etc.  Initially the prepare callback didn't exist, the code in
AsyncReadBuffers() was a lot more complicated before it.


> I assume it's somehow related to error handling, but I didn't quite get
> it. Perhaps an "abort" callback that'd be called on error, instead of a
> "prepare" callback, would be better?

I don't think an error callback would be helpful - the whole thing is that we
basically need claim ownership of all IO related resources IFF the IO is
staged. Not before (because then the IO not getting staged would mean we have
a resource leak), not after (because we might error out and thus not keep
e.g. buffers pinned).


> There are some synonyms used in the code: I think "in-flight" and
> "submitted" mean the same thing.

Fair.  I guess in my mind the process of moving an IO into flight is
"submitting" and the state of not having been submitted but not yet having
completed is being in flight. But that's probably not useful.


> And "prepared" and "staged". I'd suggest picking just one term for each
> concept.

Agreed.


> I didn't understand the COMPLETED_SHARED and COMPLETED_LOCAL states. does a
> single IO go through both states, or are the mutually exclusive? At quick
> glance, I don't actually see any code that would set the COMPLETED_LOCAL
> state; is it dead code?

It's dead code right now. I've made it dead and undead a couple times
:/. Unfortunately I think I need to revive it to make some corner cases with
temporary tables work (AIO for temp table is executed via IO uring, another
backend waits for *another* IO executed via that IO uring instance and reaps
the completion -> we can't update the local buffer state in the shared
completion callback).


> REAPED feels like a bad name. It sounds like a later stage than COMPLETED,
> but it's actually vice versa.

What would you call having gotten "completion notifications" from the kernel,
but not having processed them?



> > - There's no obvious way to tell "internal" function operating on an IO handle
> >    apart from functions that are expected to be called by the issuer of an IO.
> > 
> >    One way to deal with this would be to introduce a distinct "issuer IO
> >    reference" type.  I think that might be a good idea, it would also make it
> >    clearer that a good number of the functions can only be called by the
> >    issuer, before the IO is submitted.
> > 
> >    This would also make it easier to order functions more sensibly in aio.c, as
> >    all the issuer functions would be together.
> > 
> >    The functions on AIO handles that everyone can call already have a distinct
> >    type (PgAioHandleRef vs PgAioHandle*).
> 
> Hmm, yeah I think you might be onto something here.

I'll give it a try.


> Could pgaio_io_get() return an PgAioHandleRef directly, so that the issuer
> would never see a raw PgAioHandle ?

Don't think that would be helpful - that way there'd be no difference at all
anymore between what functions any backend can call and what the issuer can
do.


>
> Finally, attached are a couple of typos and other trivial suggestions.

Integrating...


Thanks!

Andres
pgsql-hackers by date:
From: Heikki Linnakangas
Date: 07 January, 19:08:51
Subject: Re: AIO v2.2
From: Fabrízio de Royes Mello
Date: 07 January, 19:25:31
Subject: Re: Small patch to use pqMsg_Query instead of `Q`
Re: AIO v2.2 - Mailing list pgsql-hackers

Previous

Next