Thread: Re: AIO v2.3
On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres@anarazel.de> wrote: +/* caller will issue more io, don't submit */ +#define READ_BUFFERS_MORE_MORE_MORE (1 << 3) > - Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or > such. I don't love that name though. Problem statement: You want to be able to batch I/O submission, ie make a single call to ioring_enter() (and other mechanisms) to start several I/Os, but the code that submits is inside StartReadBuffers() and the code that knows how many I/Os it wants to start now is at a higher level, read_stream.c and in future elsewhere. So you invented this flag to tell StartReadBuffers() not to call pgaio_submit_staged(), because you promise to do it later, via this staging list. Additionally, there is a kind of programming rule here that you *must* submit I/Os that you stage, you aren't allowed to (for example) stage I/Os and then sleep, so it has to be a fairly tight piece of code. Would the API be better like this?: When you want to create a batch of I/Os submitted together, you wrap the work in pgaio_begin_batch() and pgaio_submit_batch(), eg the loop in read_stream_lookahead(). Then bufmgr wouldn't need this flag: when it (or anything else) calls smgrstartreadv(), if there is not currently an explicit batch then it would be submitted immediately, and otherwise it would only be staged. This way, batch construction (or whatever word you prefer for batch) is in a clearly and explicitly demarcated stretch of code in one lexical scope (though its effect is dynamically scoped just like the staging list itself because we don't want to pass explicit I/O contexts through the layers), but code that doesn't call those and reaches AsyncReadBuffer() or whatever gets an implicit batch of size one and that's also OK. Not sure what semantics nesting would have but I doubt it matters much. > Things that need to be fixed / are fixed in this: > - max pinned buffers should be limited by io_combine_limit, not * 4 > - overflow distance > - pins need to be limited in more places I have patches for these and a few more things and will post in a separate thread shortly because they can be understood without reference to this AIO stuff and that'll hopefully be more digestible. + /* + * In some rare-ish cases one operation causes multiple reads (e.g. if a + * buffer was concurrently read by another backend). It'd be much better + * if we ensured that each ReadBuffersOperation covered only one IO - but + * that's not entirely trivial, due to having pinned victim buffers before + * starting IOs. + * + * TODO: Change the API of StartReadBuffers() to ensure we only ever need + * one IO. Likewise. + /* IO finished, but result has not yet been processed */ + PGAIO_HS_COMPLETED_IO, + + /* IO completed, shared completion has been called */ + PGAIO_HS_COMPLETED_SHARED, + + /* IO completed, local completion has been called */ + PGAIO_HS_COMPLETED_LOCAL, (Repeating something I mentioned in off-list bikeshedding) I wondered if it might be clearer to use the terminology "terminated" for the work that PostgreSQL has to do after an I/O completes, instead of overloading/subdividing the term "completed". We already "terminate" an I/O when smgr I/O completes in pre-existing bufmgr terminology, and this feels like a sort of generalisation of that notion. In this AIO world, some work is done by the backend that receives the completion notification from the kernel, and some is done by the backend that submitted the I/O in the first place, a division that doesn't exist with simple synchronous system calls. I wonder if it would be clearer to use terms based on those two roles, rather than "shared" and "local", leading to something like: PGAIO_HS_COMPLETED, PGAIO_HS_TERMINATED_BY_COMPLETER, PGAIO_HS_TERMINATED_BY_SUBMITTER,
Hi, On 2025-02-11 11:48:38 +1300, Thomas Munro wrote: > On Thu, Jan 23, 2025 at 5:29 PM Andres Freund <andres@anarazel.de> wrote: > > - Heikki doesn't love pgaio_submit_staged(), suggested pgaio_kick_staged() or > > such. I don't love that name though. > > Problem statement: You want to be able to batch I/O submission, ie > make a single call to ioring_enter() (and other mechanisms) to start > several I/Os, but the code that submits is inside StartReadBuffers() > and the code that knows how many I/Os it wants to start now is at a > higher level, read_stream.c and in future elsewhere. So you invented > this flag to tell StartReadBuffers() not to call > pgaio_submit_staged(), because you promise to do it later, via this > staging list. Additionally, there is a kind of programming rule here > that you *must* submit I/Os that you stage, you aren't allowed to (for > example) stage I/Os and then sleep, so it has to be a fairly tight > piece of code. > > Would the API be better like this?: When you want to create a batch > of I/Os submitted together, you wrap the work in pgaio_begin_batch() > and pgaio_submit_batch(), eg the loop in read_stream_lookahead(). > Then bufmgr wouldn't need this flag: when it (or anything else) calls > smgrstartreadv(), if there is not currently an explicit batch then it > would be submitted immediately, and otherwise it would only be staged. > This way, batch construction (or whatever word you prefer for batch) > is in a clearly and explicitly demarcated stretch of code in one > lexical scope (though its effect is dynamically scoped just like the > staging list itself because we don't want to pass explicit I/O > contexts through the layers), but code that doesn't call those and > reaches AsyncReadBuffer() or whatever gets an implicit batch of size > one and that's also OK. Not sure what semantics nesting would have > but I doubt it matters much. I'm a bit unexcited about the work to redesign this, but I also admit that you have a point :) Linux calls a similar concept "plugging" the queue. I think I like "batch" better, but only marginally. > > Things that need to be fixed / are fixed in this: > > - max pinned buffers should be limited by io_combine_limit, not * 4 > > - overflow distance > > - pins need to be limited in more places > > I have patches for these and a few more things and will post in a > separate thread shortly because they can be understood without > reference to this AIO stuff and that'll hopefully be more digestible. Yay! > + /* IO finished, but result has not yet been processed */ > + PGAIO_HS_COMPLETED_IO, > + > + /* IO completed, shared completion has been called */ > + PGAIO_HS_COMPLETED_SHARED, > + > + /* IO completed, local completion has been called */ > + PGAIO_HS_COMPLETED_LOCAL, > > (Repeating something I mentioned in off-list bikeshedding) I wondered > if it might be clearer to use the terminology "terminated" for the > work that PostgreSQL has to do after an I/O completes, instead of > overloading/subdividing the term "completed". We already "terminate" > an I/O when smgr I/O completes in pre-existing bufmgr terminology, and > this feels like a sort of generalisation of that notion. I have a mild laziness preference for complete over terminate, but not more. If others agree with Thomas, I'm ok with changing it. > In this AIO world, some work is done by the backend that receives the > completion notification from the kernel, and some is done by the backend > that submitted the I/O in the first place, a division that doesn't exist > with simple synchronous system calls. I wonder if it would be clearer to > use terms based on those two roles, rather than "shared" and "local", > leading to something like: > > PGAIO_HS_COMPLETED, > PGAIO_HS_TERMINATED_BY_COMPLETER, > PGAIO_HS_TERMINATED_BY_SUBMITTER, I don't love those, because the SHARED/LOCAL does imply more clearly what you have access to. I.e. executing things in a shared completion callback for IO on a temporary buffer doesn't make sense, you won't have access to the local buffer table. Greetings, Andres Freund
On Mon, Feb 10, 2025 at 2:40 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... > Problem statement: You want to be able to batch I/O submission, ie > make a single call to ioring_enter() (and other mechanisms) to start > several I/Os, but the code that submits is inside StartReadBuffers() > and the code that knows how many I/Os it wants to start now is at a > higher level, read_stream.c and in future elsewhere. So you invented > this flag to tell StartReadBuffers() not to call > pgaio_submit_staged(), because you promise to do it later, via this > staging list. Additionally, there is a kind of programming rule here > that you *must* submit I/Os that you stage, you aren't allowed to (for > example) stage I/Os and then sleep, so it has to be a fairly tight > piece of code. > > Would the API be better like this?: When you want to create a batch > of I/Os submitted together, you wrap the work in pgaio_begin_batch() > and pgaio_submit_batch(), eg the loop in read_stream_lookahead(). > Then bufmgr wouldn't need this flag: when it (or anything else) calls > smgrstartreadv(), if there is not currently an explicit batch then it > would be submitted immediately, and otherwise it would only be staged. > This way, batch construction (or whatever word you prefer for batch) > is in a clearly and explicitly demarcated stretch of code in one > lexical scope (though its effect is dynamically scoped just like the > staging list itself because we don't want to pass explicit I/O > contexts through the layers), but code that doesn't call those and > reaches AsyncReadBuffer() or whatever gets an implicit batch of size > one and that's also OK. Not sure what semantics nesting would have > but I doubt it matters much. I like this idea. If we want to submit a batch, then just submit a batch. James
On Tue, Feb 11, 2025 at 12:11 PM James Hunter <james.hunter.pg@gmail.com> wrote: > I like this idea. If we want to submit a batch, then just submit a batch. Sounds good to me, too. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2025-02-11 11:48:38 +1300, Thomas Munro wrote: > Would the API be better like this?: When you want to create a batch > of I/Os submitted together, you wrap the work in pgaio_begin_batch() > and pgaio_submit_batch(), eg the loop in read_stream_lookahead(). One annoying detail is that an API like this would afaict need resowner support or something along those lines (e.g. xact callbacks plus code in each aux process' sigsetjmp() block). Otherwise I don't know how we would ensure that the "batch-is-in-progress" flag/counter would get reset. Alternatively we could make pgaio_batch_begin() basically start a critical section, but that doesn't seem like a good idea, because too much that needs to happen around buffered IO isn't compatible with critical sections. Does anybody see a need for batches to be nested? I'm inclined to think that that would be indicative of bugs and should therefore error/assert out. One way we could avoid the need for a mechanism to reset-batch-in-progress would be to make batch submission controlled by a flag on the IO. Something like pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT) IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly submitted using something like the existing pgaio_submit_staged(); (although renaming it to something with batch in the name might be appropriate) That way there's no explicit "we are in a batch" state that needs to be reset in case of errors. Greetings, Andres Freund
On Tue, Feb 11, 2025 at 4:43 PM Andres Freund <andres@anarazel.de> wrote: > Alternatively we could make pgaio_batch_begin() basically start a critical > section, but that doesn't seem like a good idea, because too much that needs > to happen around buffered IO isn't compatible with critical sections. A critical section sounds like a bad plan. > Does anybody see a need for batches to be nested? I'm inclined to think that > that would be indicative of bugs and should therefore error/assert out. I can imagine somebody wanting to do it, but I think we can just say no. I mean, it's no different from WAL record construction. There's no theoretical reason you couldn't want to concurrently construct multiple WAL records and then submit them one after another, but if you want to do that, you have to do your own bookkeeping. It seems fine to apply the same principle here. > One way we could avoid the need for a mechanism to reset-batch-in-progress > would be to make batch submission controlled by a flag on the IO. Something > like > pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT) > > IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly > submitted using something like the existing > pgaio_submit_staged(); > (although renaming it to something with batch in the name might be > appropriate) > > That way there's no explicit "we are in a batch" state that needs to be reset > in case of errors. I'll defer to Thomas or others on whether this is better or worse, because I don't know. It means that the individual I/Os have to know that they are in a batch, which isn't necessary with the begin/end batch interface. But if we're expecting that to happen in a pretty confined amount of code -- similar to WAL record construction -- then that might not be a problem anyway. I think if you don't do this, I'd do (sub)xact callbacks rather than a resowner integration, unless you decide you want to support multiple concurrent batches. You don't really need or want to tie it to a resowner unless there are multiple objects each of which can have its own resources. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 11, 2025 at 1:44 PM Andres Freund <andres@anarazel.de> wrote: > ... > Alternatively we could make pgaio_batch_begin() basically start a critical > section, but that doesn't seem like a good idea, because too much that needs > to happen around buffered IO isn't compatible with critical sections. > > > Does anybody see a need for batches to be nested? I'm inclined to think that > that would be indicative of bugs and should therefore error/assert out. Fwiw, in a similar situation in the past, I just blocked, waiting for the in-flight batch to complete, before sending the next batch. So I had something like: void begin_batch(...) { if (batch_in_progress()) complete_batch(...); /* ok start the batch now */ } In my case, batches were nested because different access methods (e.g., Index) can call/trigger other access methods (Heap), and both access methods might want to issue batch reads. However, the "inner" access method might not be aware of the "outer" access method. For simplicity, then, I just completed the outer batch. Note that this is not optimal for performance (because a nested batch ends up stalling the outer batch), but it does keep the code simple... James
Hi, On 2025-02-12 13:00:22 -0500, Robert Haas wrote: > On Tue, Feb 11, 2025 at 4:43 PM Andres Freund <andres@anarazel.de> wrote: > > One way we could avoid the need for a mechanism to reset-batch-in-progress > > would be to make batch submission controlled by a flag on the IO. Something > > like > > pgaio_io_set_flag(ioh, PGAIO_HF_BATCH_SUBMIT) > > > > IFF PGAIO_HF_BATCH_SUBMIT is set, the IOs would need to be explicitly > > submitted using something like the existing > > pgaio_submit_staged(); > > (although renaming it to something with batch in the name might be > > appropriate) > > > > That way there's no explicit "we are in a batch" state that needs to be reset > > in case of errors. > > I'll defer to Thomas or others on whether this is better or worse, > because I don't know. It means that the individual I/Os have to know > that they are in a batch, which isn't necessary with the begin/end > batch interface. But if we're expecting that to happen in a pretty > confined amount of code -- similar to WAL record construction -- then > that might not be a problem anyway. > > I think if you don't do this, I'd do (sub)xact callbacks rather than a > resowner integration, unless you decide you want to support multiple > concurrent batches. You don't really need or want to tie it to a > resowner unless there are multiple objects each of which can have its > own resources. I have working code for that. There unfortunately is an annoying problem: It afaict is not really possible to trigger a WARNING/ERRROR/Assert in xact callbacks at the end of commands in an explicit transaction. Consider something like: BEGIN; SELECT start_but_not_end_aio_batch(); we would like to flag that the SELECT query started an AIO batch but didn't end it. But at the momment there, afaict, is no proper way to do that, because xact.c won't actually do much at the end of a command in a transaction block: /* * This is the case when we have finished executing a command * someplace within a transaction block. We increment the command * counter and return. */ case TBLOCK_INPROGRESS: case TBLOCK_IMPLICIT_INPROGRESS: case TBLOCK_SUBINPROGRESS: CommandCounterIncrement(); break; If one instead integrates with resowners, that kind of thing works, because exec_simple_query() calls PortalDrop(), which in turn calls ResourceOwnerRelease(). And we don't reliably warn at a later time. While xact.c integration triggers a warning for: BEGIN; SELECT start_but_not_end_aio_batch(); COMMIT; as we'd still have the batch open at COMMIT, it wouldn't trigger for BEGIN; SELECT start_but_not_end_aio_batch(); ROLLBACK; as AbortTransaction() might be called in an error and therefore can't assume that it's a problem for a batch to still be open. I guess I could just put something alongside that CommandCounterIncrement() call, but that doesn't seem right. I guess putting it alongside the ResourceOwnerRelease() in PortalDrop() is a bit less bad? But still doesn't seem great. Just using resowners doesn't seem right either, it's not really free to register something with resowners, and for read intensive IO we can start a *lot* of batches, so doing unnecessary work isn't great. Greetings, Andres Freund
On Mon, Feb 17, 2025 at 4:01 PM Andres Freund <andres@anarazel.de> wrote: > If one instead integrates with resowners, that kind of thing works, because > exec_simple_query() calls PortalDrop(), which in turn calls > ResourceOwnerRelease(). Hmm, so maybe that's a reason to do it via resowner.c, then. The fact that it's a singleton object is a bit annoying, but you could make it not a singleton, and then either pass the relevant one to the interface functions, or store the current one in a global variable similar to CurrentMemoryContext or similar. > I guess I could just put something alongside that CommandCounterIncrement() > call, but that doesn't seem right. I guess putting it alongside the > ResourceOwnerRelease() in PortalDrop() is a bit less bad? But still doesn't > seem great. The thing that's weird about that is that it isn't really logically linked to the portal. It feels like it more properly belongs in StartTransactionCommand() / CommitTransactionCommand(). > Just using resowners doesn't seem right either, it's not really free to > register something with resowners, and for read intensive IO we can start a > *lot* of batches, so doing unnecessary work isn't great. You don't necessarily have to register a new object for every batch, do you? You could just register one and keep reusing it for the lifetime of the query. -- Robert Haas EDB: http://www.enterprisedb.com