Re: AIO v2.5 - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 20250325133321.54.nmisch@google.com Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote: > On 2025-03-24 17:45:37 -0700, Noah Misch wrote: > > (We may be due for a test mode that does smgrreleaseall() at every > > CHECK_FOR_INTERRUPTS()?) > > I suspect we are. I'm a bit afraid of even trying... > > ... > > It's extremely slow - but at least the main regression as well as the aio tests pass! One less thing! > > > I however don't particularly like the *start* or *prep* names, I've gone back > > > and forth on those a couple times. I could see "begin" work uniformly across > > > those. > > > > For ease of new readers understanding things, I think it helps for the > > functions that advance PgAioHandleState to have names that use words from > > PgAioHandleState. It's one less mapping to get into the reader's head. > > Unfortunately for me it's kind of the opposite in this case, see below. > > > > "Begin", "Start" and "prep" are all outside that taxonomy, making the reader > > learn how to map them to the taxonomy. What reward does the reader get at the > > end of that exercise? I'm not seeing one, but please do tell me what I'm > > missing here. > > Because the end state varies, depending on the number of previously staged > IOs, the IO method and whether batchmode is enabled, I think it's better if > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is > *not* aligned with an internal state name. It will just mislead readers to > think that there's a deterministic mapping when that does not exist. That's fair. Could we provide the mapping in a comment, something like the following? --- a/src/include/storage/aio_internal.h +++ b/src/include/storage/aio_internal.h @@ -34,5 +34,10 @@ * linearly through all states. * - * State changes should all go through pgaio_io_update_state(). + * State changes should all go through pgaio_io_update_state(). Its callers + * use these naming conventions: + * + * - A "start" function (e.g. FileStartReadV()) moves an IO from + * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most + * PGAIO_HS_COMPLETED_LOCAL. */ typedef enum PgAioHandleState > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different > naming that I just stopped seeing. > > I'll try to think more about this, perhaps I can make myself see your POV > more. > > As the patch stands, LockBufferForCleanup() can succeed when > > ConditionalLockBufferForCleanup() would have returned false. > > That's already true today, right? In master ConditionalLockBufferForCleanup() > for temp buffers checks LocalRefCount, whereas LockBufferForCleanup() doesn't. I'm finding a LocalRefCount check under LockBufferForCleanup: LockBufferForCleanup(Buffer buffer) { ... CheckBufferIsPinnedOnce(buffer); CheckBufferIsPinnedOnce(Buffer buffer) { if (BufferIsLocal(buffer)) { if (LocalRefCount[-buffer - 1] != 1) elog(ERROR, "incorrect local pin count: %d", LocalRefCount[-buffer - 1]); } else { if (GetPrivateRefCount(buffer) != 1) elog(ERROR, "incorrect local pin count: %d", GetPrivateRefCount(buffer)); } } > > Like the comment, I expect it's academic today. I expect it will stay > > academic. Anything that does a cleanup will start by reading the buffer, > > which will resolve any refcnt the AIO subsystems holds for a read. If there's > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on > > that. How about just removing the ConditionalLockBufferForCleanup() changes > > or replacing them with a comment (like the present paragraph)? > > I think we'll need an expanded version of what I suggest once we have writes - > but as you say, it shouldn't matter as long as we only have reads. So I think > moving the relevant changes, with adjusted caveats, to the bufmgr: write > change makes sense. Moving those changes works for me. I'm not currently seeing the need under writes, but that may get clearer upon reaching those patches. > > > > /* --- > > > > * Opt-in to using AIO batchmode. > > > > * > > > > * Submitting IO in larger batches can be more efficient than doing so > > > > * one-by-one, particularly for many small reads. It does, however, require > > > > * the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO > > > > * batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not: > > > > * a) block without first calling pgaio_submit_staged(), unless a > > > > * to-be-waited-on lock cannot be part of a deadlock, e.g. because it is > > > > * never acquired in a nested fashion > > > > * b) directly or indirectly start another batch pgaio_enter_batchmode() > > > > I think a callback could still do: > > > > pgaio_exit_batchmode() > > ... arbitrary code that might reach pgaio_enter_batchmode() ... > > pgaio_enter_batchmode() > > Yea - but I somehow doubt there are many cases where it makes sense to > deep-queue IOs within the callback. The cases I can think of are things like > ensuring the right VM buffer is in s_b. But if it turns out to be necessary, > what you seuggest would be an out. I don't foresee a callback specifically wanting to batch, but callbacks might call into other infrastructure that can elect to batch. The exit+reenter pattern would be better than adding no-batch options to other infrastructure. > Do you think it's worth mentioning the above workaround? I'm mildly inclined > that not. Perhaps not in that detail, but perhaps we can rephrase (b) to not imply exit+reenter is banned. Maybe "(b) start another batch (without first exiting one)". It's also fine as-is, though. > If it turns out to be actually useful to do nested batching, we can change it > so that nested batching *is* allowed, that'd not be hard. Good point. > I'm ok with all of these. In order of preference: > > 1) READ_STREAM_USE_BATCHING or READ_STREAM_BATCH_OK > 2) READ_STREAM_BATCHMODE_AWARE > 3) READ_STREAM_CALLBACK_BATCHMODE_AWARE Same for me.
pgsql-hackers by date: