Re: AIO v2.3 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.3 |
Date | |
Msg-id | 42wltxewfbl7uj6rehcaciythpkm2k2lt7zqfpmj76mztn42yc@hauc6p2qv6lh Whole thread Raw |
In response to | Re: AIO v2.3 (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Hi, On 2025-02-11 11:48:38 +1300, Thomas Munro wrote: > On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres@anarazel.de> wrote: > > - Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or > > such. I don't love that name though. > > Problem statement: You want to be able to batch I/O submission, ie > make a single call to ioring_enter() (and other mechanisms) to start > several I/Os, but the code that submits is inside StartReadBuffers() > and the code that knows how many I/Os it wants to start now is at a > higher level, read_stream.c and in future elsewhere. So you invented > this flag to tell StartReadBuffers() not to call > pgaio_submit_staged(), because you promise to do it later, via this > staging list. Additionally, there is a kind of programming rule here > that you *must* submit I/Os that you stage, you aren't allowed to (for > example) stage I/Os and then sleep, so it has to be a fairly tight > piece of code. > > Would the API be better like this?: When you want to create a batch > of I/Os submitted together, you wrap the work in pgaio_begin_batch() > and pgaio_submit_batch(), eg the loop in read_stream_lookahead(). > Then bufmgr wouldn't need this flag: when it (or anything else) calls > smgrstartreadv(), if there is not currently an explicit batch then it > would be submitted immediately, and otherwise it would only be staged. > This way, batch construction (or whatever word you prefer for batch) > is in a clearly and explicitly demarcated stretch of code in one > lexical scope (though its effect is dynamically scoped just like the > staging list itself because we don't want to pass explicit I/O > contexts through the layers), but code that doesn't call those and > reaches AsyncReadBuffer() or whatever gets an implicit batch of size > one and that's also OK. Not sure what semantics nesting would have > but I doubt it matters much. I'm a bit unexcited about the work to redesign this, but I also admit that you have a point :) Linux calls a similar concept "plugging" the queue. I think I like "batch" better, but only marginally. > > Things that need to be fixed / are fixed in this: > > - max pinned buffers should be limited by io_combine_limit, not * 4 > > - overflow distance > > - pins need to be limited in more places > > I have patches for these and a few more things and will post in a > separate thread shortly because they can be understood without > reference to this AIO stuff and that'll hopefully be more digestible. Yay! > + /* IO finished, but result has not yet been processed */ > + PGAIO_HS_COMPLETED_IO, > + > + /* IO completed, shared completion has been called */ > + PGAIO_HS_COMPLETED_SHARED, > + > + /* IO completed, local completion has been called */ > + PGAIO_HS_COMPLETED_LOCAL, > > (Repeating something I mentioned in off-list bikeshedding) I wondered > if it might be clearer to use the terminology "terminated" for the > work that PostgreSQL has to do after an I/O completes, instead of > overloading/subdividing the term "completed". We already "terminate" > an I/O when smgr I/O completes in pre-existing bufmgr terminology, and > this feels like a sort of generalisation of that notion. I have a mild laziness preference for complete over terminate, but not more. If others agree with Thomas, I'm ok with changing it. > In this AIO world, some work is done by the backend that receives the > completion notification from the kernel, and some is done by the backend > that submitted the I/O in the first place, a division that doesn't exist > with simple synchronous system calls. I wonder if it would be clearer to > use terms based on those two roles, rather than "shared" and "local", > leading to something like: > > PGAIO_HS_COMPLETED, > PGAIO_HS_TERMINATED_BY_COMPLETER, > PGAIO_HS_TERMINATED_BY_SUBMITTER, I don't love those, because the SHARED/LOCAL does imply more clearly what you have access to. I.e. executing things in a shared completion callback for IO on a temporary buffer doesn't make sense, you won't have access to the local buffer table. Greetings, Andres Freund
pgsql-hackers by date: