Re: AIO v2.2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.2
Date
Msg-id 5wsjqnuhs42awifjlbepcfxea442id5zhj44bwi6mf7rdrimtp@342nf7wirc5d
Whole thread Raw
In response to Re: AIO v2.2  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2025-01-07 22:09:56 +0200, Heikki Linnakangas wrote:
> On 07/01/2025 18:11, Andres Freund wrote:
> > > I didn't quite understand the point of the prepare callbacks. For example,
> > > when AsyncReadBuffers() calls smgrstartreadv(), the
> > > shared_buffer_readv_prepare() callback will be called. Why doesn't
> > > AsyncReadBuffers() do the "prepare" work itself directly; why does it need
> > > to be in a callback?
> >
> > One big part of it is "ownership" - while the IO isn't completely "assembled",
> > we can release all buffer pins etc in case of an error. But if the error
> > happens just after the IO was staged, we can't - the buffer is still
> > referenced by the IO. For that the AIO subystem needs to take its own pins
> > etc.  Initially the prepare callback didn't exist, the code in
> > AsyncReadBuffers() was a lot more complicated before it.
> >
> >
> > > I assume it's somehow related to error handling, but I didn't quite get
> > > it. Perhaps an "abort" callback that'd be called on error, instead of a
> > > "prepare" callback, would be better?
> >
> > I don't think an error callback would be helpful - the whole thing is that we
> > basically need claim ownership of all IO related resources IFF the IO is
> > staged. Not before (because then the IO not getting staged would mean we have
> > a resource leak), not after (because we might error out and thus not keep
> > e.g. buffers pinned).
>
> Hmm. The comments say that when you call smgrstartreadv(), the IO handle may
> no longer be modified, as the IO may be executed immediately. What if we
> changed that so that it never submits the IO, only adds the necessary
> callbacks to it?

> In that world, when smgrstartreadv() returns, the necessary details and
> completion callbacks have been set in the IO handle, but the caller can
> still do more preparation before the IO is submitted. The caller must ensure
> that it gets submitted, however, so no erroring out in that state.
>
> Currently the call stack looks like this:
>
> AsyncReadBuffers()
> -> smgrstartreadv()
>   -> mdstartreadv()
>     -> FileStartReadV()
>       -> pgaio_io_prep_readv()
>         -> shared_buffer_readv_prepare() (callback)
>         <- (return)
>       <- (return)
>     <- (return)
>   <- (return)
> <- (return)
>
> I'm thinking that the prepare work is done "on the way up" instead:
>
> AsyncReadBuffers()
> -> smgrstartreadv()
>   -> mdstartreadv()
>     -> FileStartReadV()
>       -> pgaio_io_prep_readv()
>       <- (return)
>     <- (return)
>   <- (return)
> -> shared_buffer_readv_prepare()
> <- (return)
>
> Attached is a patch to demonstrate concretely what I mean.

I think this would be somewhat limiting. Right now it's indeed just bufmgr.c
that needs to do a preparation (or "moving of ownership") step - but I don't
think it's necessarily going to stay that way.

Consider e.g. a hypothetical threaded future in which we have refcounted file
descriptors. While AIO is ongoing, the AIO subsystem would need to ensure that
the FD refcount is increased, otherwise you'd obviously run into trouble if
the issuing backend errored out and released its own reference as part of
resowner release.

I don't think the approach you suggest above would scale well for such a
situation - shared_buffer_readv_prepare() would again need to call to
smgr->md->fd.  Whereas with the current approach md.c (or fd.c?) could just
define its own prepare callback that increased the refcount at the right
moment.


There's a few other scenarios I can think of:

- If somebody were - no idea what made me think of that - to write an smgr
  implementation where storage is accessed over the network, one might need to
  keep network buffers and sockets alive for the duration of the IO.

- It'd be rather useful to have support for asynchronously extending a
  relation, that often requires filesystem journal IO and thus is slow. If
  you're bulk loading, or the extension lock is contented, it'd be great if we
  could start the next relation extension *before* it's needed and the
  extension has to happen synchronously.  To avoid deadlocks, such an
  asynchronous extension would need to be able to release the lock in any
  other backend, just like it's needed for the content locks when
  asynchronously writing.  Which in turn would require transferring ownership
  of the relevant buffers *and* the extension lock.  You could mash this
  together, but it seems like a separate callback woul make it more
  composable.


Does that make any sense to you?


> This adds a new pgaio_io_stage() step to the issuer, and the issuer needs to
> call the prepare functions explicitly, instead of having them as callbacks.
> Nominally that's more steps, but IMHO it's better to be explicit. The same
> actions were happening previously too, it was just hidden in the callback. I
> updated the README to show that too.
>
> I'm not wedded to this, but it feels a little better to me.

Right now the current approach seems to make more sense to me, but I'll think
about it more. I might also have missed something with my theorizing above.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: use a non-locking initial test in TAS_SPIN on AArch64
Next
From: Tom Lane
Date:
Subject: Re: use a non-locking initial test in TAS_SPIN on AArch64