Re: AIO v2.5 - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 20250325155808.f7.nmisch@google.com Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
On Tue, Mar 25, 2025 at 11:26:14AM -0400, Andres Freund wrote: > On 2025-03-25 06:33:21 -0700, Noah Misch wrote: > > On Mon, Mar 24, 2025 at 10:30:27PM -0400, Andres Freund wrote: > > > On 2025-03-24 17:45:37 -0700, Noah Misch wrote: > > > > (We may be due for a test mode that does smgrreleaseall() at every > > > > CHECK_FOR_INTERRUPTS()?) > > > > > > I suspect we are. I'm a bit afraid of even trying... > > > > > > ... > > > > > > It's extremely slow - but at least the main regression as well as the aio tests pass! > > > > One less thing! > > Unfortunately I'm now doubting the thoroughness of my check - while I made > every CFI() execute smgrreleaseall(), I didn't trigger CFI() in cases where we > trigger it conditionally. E.g. elog(DEBUGN, ...) only executes a CFI if > log_min_messages <= DEBUGN... > > I'll try that in a bit. While having nagging thoughts that we might be releasing FDs before io_uring gets them into kernel custody, I tried this hack to maximize FD turnover: static void ReleaseLruFiles(void) { #if 0 while (nfile + numAllocatedDescs + numExternalFDs >= max_safe_fds) { if (!ReleaseLruFile()) break; } #else while (ReleaseLruFile()) ; #endif } "make check" with default settings (io_method=worker) passes, but io_method=io_uring in the TEMP_CONFIG file got different diffs in each of two runs. s/#if 0/#if 1/ (restore normal FD turnover) removes the failures. Here's the richer of the two diffs: diff -U3 src/test/regress/expected/sanity_check.out src/test/regress/results/sanity_check.out --- src/test/regress/expected/sanity_check.out 2024-10-24 12:43:25.741817594 -0700 +++ src/test/regress/results/sanity_check.out 2025-03-25 08:27:44.875151566 -0700 @@ -1,4 +1,7 @@ VACUUM; +ERROR: index "pg_enum_oid_index" contains corrupted page at block 2 +HINT: Please REINDEX it. +CONTEXT: while vacuuming index "pg_enum_oid_index" of relation "pg_catalog.pg_enum" -- -- Sanity check: every system catalog that has OIDs should have -- a unique index on OID. This ensures that the OIDs will be unique, diff -U3 src/test/regress/expected/oidjoins.out src/test/regress/results/oidjoins.out --- src/test/regress/expected/oidjoins.out 2023-07-06 19:58:07.686364439 -0700 +++ src/test/regress/results/oidjoins.out 2025-03-25 08:28:02.584335458 -0700 @@ -233,6 +233,8 @@ NOTICE: checking pg_policy {polrelid} => pg_class {oid} NOTICE: checking pg_policy {polroles} => pg_authid {oid} NOTICE: checking pg_default_acl {defaclrole} => pg_authid {oid} +WARNING: FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,5)",0) +WARNING: FK VIOLATION IN pg_default_acl({defaclrole}): ("(1,7)",402654464) NOTICE: checking pg_default_acl {defaclnamespace} => pg_namespace {oid} NOTICE: checking pg_init_privs {classoid} => pg_class {oid} NOTICE: checking pg_seclabel {classoid} => pg_class {oid} > > > Because the end state varies, depending on the number of previously staged > > > IOs, the IO method and whether batchmode is enabled, I think it's better if > > > the "function naming pattern" (i.e. FileStartReadv, smgrstartreadv etc) is > > > *not* aligned with an internal state name. It will just mislead readers to > > > think that there's a deterministic mapping when that does not exist. > > > > That's fair. Could we provide the mapping in a comment, something like the > > following? > > Yes! > > I wonder if it should also be duplicated or referenced elsewhere, although I > am not sure where precisely. I considered the README.md also, but adding that wasn't an obvious win. > > --- a/src/include/storage/aio_internal.h > > +++ b/src/include/storage/aio_internal.h > > @@ -34,5 +34,10 @@ > > * linearly through all states. > > * > > - * State changes should all go through pgaio_io_update_state(). > > + * State changes should all go through pgaio_io_update_state(). Its callers > > + * use these naming conventions: > > + * > > + * - A "start" function (e.g. FileStartReadV()) moves an IO from > > + * PGAIO_HS_HANDED_OUT to at least PGAIO_HS_STAGED and at most > > + * PGAIO_HS_COMPLETED_LOCAL. > > */ > > typedef enum PgAioHandleState > > One detail I'm not sure about: The above change is correct, but perhaps a bit > misleading, because we can actually go "back" to IDLE. Not sure how to best > phrase that though. Not sure either. Maybe the above could change to "to PGAIO_HS_STAGED or any subsequent state" and the comment at PGAIO_HS_STAGED could say like "Once in this state, concurrent activity could move the IO all the way to PGAIO_HS_COMPLETED_LOCAL and recycle it back to IDLE." > > > That's not an excuse for pgaio_io_prep* though, that's a pointlessly different > > > naming that I just stopped seeing. > > I assume you're on board with renaming _io_prep* to _io_start_*? Yes. > > > I'll try to think more about this, perhaps I can make myself see your POV > > > more. > > CheckBufferIsPinnedOnce(Buffer buffer) > > { > > if (BufferIsLocal(buffer)) > > { > > if (LocalRefCount[-buffer - 1] != 1) > > elog(ERROR, "incorrect local pin count: %d", > > LocalRefCount[-buffer - 1]); > > } > > else > > { > > if (GetPrivateRefCount(buffer) != 1) > > elog(ERROR, "incorrect local pin count: %d", > > GetPrivateRefCount(buffer)); > > } > > } > > Pretty random orthogonal thought, that I was reminded of by the above code > snippet: > > It sure seems we should at some point get rid of LocalRefCount[] and just use > the GetPrivateRefCount() infrastructure for both shared and local buffers. I > don't think the GetPrivateRefCount() infrastructure cares about > local/non-local, leaving a few asserts aside. If we do that, and start to use > BM_IO_IN_PROGRESS, combined with ResourceOwnerRememberBufferIO(), the set of > differences between shared and local buffers would be a lot smaller. That sounds promising. > > > > Like the comment, I expect it's academic today. I expect it will stay > > > > academic. Anything that does a cleanup will start by reading the buffer, > > > > which will resolve any refcnt the AIO subsystems holds for a read. If there's > > > > an AIO write, the LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE) will block on > > > > that. How about just removing the ConditionalLockBufferForCleanup() changes > > > > or replacing them with a comment (like the present paragraph)? > > > > > > I think we'll need an expanded version of what I suggest once we have writes - > > > but as you say, it shouldn't matter as long as we only have reads. So I think > > > moving the relevant changes, with adjusted caveats, to the bufmgr: write > > > change makes sense. > > > > Moving those changes works for me. I'm not currently seeing the need under > > writes, but that may get clearer upon reaching those patches. > > FWIW, I don't think it's currently worth looking at the write side in detail, Got it. (I meant I didn't see a first-principles need, not that I had deduced lack of need from a specific writes implementation.) > > > Do you think it's worth mentioning the above workaround? I'm mildly inclined > > > that not. > > > > Perhaps not in that detail, but perhaps we can rephrase (b) to not imply > > exit+reenter is banned. Maybe "(b) start another batch (without first exiting > > one)". It's also fine as-is, though. > > I updated it to: > > * b) start another batch (without first exiting batchmode and re-entering > * before returning) That's good.
pgsql-hackers by date: