Re: AIO v2.2 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.2 |
Date | |
Msg-id | 5wsjqnuhs42awifjlbepcfxea442id5zhj44bwi6mf7rdrimtp@342nf7wirc5d Whole thread Raw |
In response to | Re: AIO v2.2 (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, On 2025-01-07 22:09:56 +0200, Heikki Linnakangas wrote: > On 07/01/2025 18:11, Andres Freund wrote: > > > I didn't quite understand the point of the prepare callbacks. For example, > > > when AsyncReadBuffers() calls smgrstartreadv(), the > > > shared_buffer_readv_prepare() callback will be called. Why doesn't > > > AsyncReadBuffers() do the "prepare" work itself directly; why does it need > > > to be in a callback? > > > > One big part of it is "ownership" - while the IO isn't completely "assembled", > > we can release all buffer pins etc in case of an error. But if the error > > happens just after the IO was staged, we can't - the buffer is still > > referenced by the IO. For that the AIO subystem needs to take its own pins > > etc. Initially the prepare callback didn't exist, the code in > > AsyncReadBuffers() was a lot more complicated before it. > > > > > > > I assume it's somehow related to error handling, but I didn't quite get > > > it. Perhaps an "abort" callback that'd be called on error, instead of a > > > "prepare" callback, would be better? > > > > I don't think an error callback would be helpful - the whole thing is that we > > basically need claim ownership of all IO related resources IFF the IO is > > staged. Not before (because then the IO not getting staged would mean we have > > a resource leak), not after (because we might error out and thus not keep > > e.g. buffers pinned). > > Hmm. The comments say that when you call smgrstartreadv(), the IO handle may > no longer be modified, as the IO may be executed immediately. What if we > changed that so that it never submits the IO, only adds the necessary > callbacks to it? > In that world, when smgrstartreadv() returns, the necessary details and > completion callbacks have been set in the IO handle, but the caller can > still do more preparation before the IO is submitted. The caller must ensure > that it gets submitted, however, so no erroring out in that state. > > Currently the call stack looks like this: > > AsyncReadBuffers() > -> smgrstartreadv() > -> mdstartreadv() > -> FileStartReadV() > -> pgaio_io_prep_readv() > -> shared_buffer_readv_prepare() (callback) > <- (return) > <- (return) > <- (return) > <- (return) > <- (return) > > I'm thinking that the prepare work is done "on the way up" instead: > > AsyncReadBuffers() > -> smgrstartreadv() > -> mdstartreadv() > -> FileStartReadV() > -> pgaio_io_prep_readv() > <- (return) > <- (return) > <- (return) > -> shared_buffer_readv_prepare() > <- (return) > > Attached is a patch to demonstrate concretely what I mean. I think this would be somewhat limiting. Right now it's indeed just bufmgr.c that needs to do a preparation (or "moving of ownership") step - but I don't think it's necessarily going to stay that way. Consider e.g. a hypothetical threaded future in which we have refcounted file descriptors. While AIO is ongoing, the AIO subsystem would need to ensure that the FD refcount is increased, otherwise you'd obviously run into trouble if the issuing backend errored out and released its own reference as part of resowner release. I don't think the approach you suggest above would scale well for such a situation - shared_buffer_readv_prepare() would again need to call to smgr->md->fd. Whereas with the current approach md.c (or fd.c?) could just define its own prepare callback that increased the refcount at the right moment. There's a few other scenarios I can think of: - If somebody were - no idea what made me think of that - to write an smgr implementation where storage is accessed over the network, one might need to keep network buffers and sockets alive for the duration of the IO. - It'd be rather useful to have support for asynchronously extending a relation, that often requires filesystem journal IO and thus is slow. If you're bulk loading, or the extension lock is contented, it'd be great if we could start the next relation extension *before* it's needed and the extension has to happen synchronously. To avoid deadlocks, such an asynchronous extension would need to be able to release the lock in any other backend, just like it's needed for the content locks when asynchronously writing. Which in turn would require transferring ownership of the relevant buffers *and* the extension lock. You could mash this together, but it seems like a separate callback woul make it more composable. Does that make any sense to you? > This adds a new pgaio_io_stage() step to the issuer, and the issuer needs to > call the prepare functions explicitly, instead of having them as callbacks. > Nominally that's more steps, but IMHO it's better to be explicit. The same > actions were happening previously too, it was just hidden in the callback. I > updated the README to show that too. > > I'm not wedded to this, but it feels a little better to me. Right now the current approach seems to make more sense to me, but I'll think about it more. I might also have missed something with my theorizing above. Greetings, Andres Freund
pgsql-hackers by date: