Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | q2jujruxxksaz7k5m7wckkfz6qvmhiqy2wjwqcgqi3qaz25scb@n7hcd6qf47je Whole thread Raw |
In response to | Re: AIO v2.5 (Noah Misch <noah@leadboat.com>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hi, On 2025-03-23 17:29:39 -0700, Noah Misch wrote: > commit 247ce06b wrote: > > + pgaio_io_reopen(ioh); > > + > > + /* > > + * To be able to exercise the reopen-fails path, allow injection > > + * points to trigger a failure at this point. > > + */ > > + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN"); > > + > > + error_errno = 0; > > + error_ioh = NULL; > > + > > + /* > > + * We don't expect this to ever fail with ERROR or FATAL, no need > > + * to keep error_ioh set to the IO. > > + * pgaio_io_perform_synchronously() contains a critical section to > > + * ensure we don't accidentally fail. > > + */ > > + pgaio_io_perform_synchronously(ioh); > > A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee > smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that > interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a > HOLD_INTERRUPTS() should surround the above region of code. It's likely hard > to reproduce a problem, because pgaio_io_call_inj() does nothing in many > builds, and pgaio_io_perform_synchronously() starts by entering a critical > section. Hm, I guess you're right - it would be pretty bonkers for the injection to process interrupts, but its much better to clarify the code to make that not an option. Once doing that it seemed we should also have a similar assertion in pgaio_io_before_prep() would be appropriate. > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > > Attached v2.11 > > > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd > > > +int > > +FileStartReadV(PgAioHandle *ioh, File file, > > + int iovcnt, off_t offset, > > + uint32 wait_event_info) > > +{ > > + int returnCode; > > + Vfd *vfdP; > > + > > + Assert(FileIsValid(file)); > > + > > + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d", > > + file, VfdCache[file].fileName, > > + (int64) offset, > > + iovcnt)); > > + > > + returnCode = FileAccess(file); > > + if (returnCode < 0) > > + return returnCode; > > + > > + vfdP = &VfdCache[file]; > > + > > + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset); > > FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED > w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from > functions so named. The "start" verb sounds to me like unconditional > PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like > the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at > PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names > FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do > you see it? I have a surprisingly strong negative reaction to that proposed naming. To me the staging is a distinct step that happens *after* the IO is fully defined. Making all the layered calls that lead up to that named that way would IMO be a bad idea. I however don't particularly like the *start* or *prep* names, I've gone back and forth on those a couple times. I could see "begin" work uniformly across those. > > +/* > > + * AIO error reporting callback for mdstartreadv(). > > + * > > + * Errors are encoded as follows: > > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0 > > I recommend replacing "errno != 0" with either "that errno" or "errno == > error_data". Done. > > Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design > > Ready for commit apart from some trivia: Great. > > +if (ioret.result.status == PGAIO_RS_ERROR) > > + pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR); > > I think ioret and aio_ret are supposed to be the same object. If that's > right, change one of the names. Likewise elsewhere in this file. You're right. > > +The central API piece for postgres' AIO abstraction are AIO handles. To > > +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and > > +then "defined", i.e. associate an IO operation with the handle. > > s/"defined"/"define" it/ or similar > > > +The "solution" to this the ability to associate multiple completion callbacks > > s/this the/this is the/ Applied. > > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well > > > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer) > > Assert(refcount > 0); > > if (refcount != 1) > > return false; > > + > > + /* > > + * Check that the AIO subsystem doesn't have a pin. Likely not > > + * possible today, but better safe than sorry. > > + */ > > + bufHdr = GetLocalBufferDescriptor(-buffer - 1); > > + buf_state = pg_atomic_read_u32(&bufHdr->state); > > + refcount = BUF_STATE_GET_REFCOUNT(buf_state); > > + Assert(refcount > 0); > > + if (refcount != 1) > > + return false; > > + > > LockBufferForCleanup() should get code like this > ConditionalLockBufferForCleanup() code, either now or when "not possible > today" ends. Currently, it just assumes all local buffers are > cleanup-lockable: > > /* Nobody else to wait for */ > if (BufferIsLocal(buffer)) > return; Kinda, yes, kinda no? LockBufferForCleanup() assumes, even for shared buffers, that the current backend can't be doing anything that conflicts with acquiring a buffer pin - note that it doesn't check the backend local pincount for shared buffers either. LockBufferForCleanup() kind of has to make that assumption, because there's no way to wait for yourself to release another pin, because obviously waiting in LockBufferForCleanup() would prevent that from ever happening. It's somewhat disheartening that the comments for LockBufferForCleanup() don't mention that somehow the caller needs to be sure not to be called with other pins on a relation. Nor does LockBufferForCleanup() have any asserts checking how many backend-local pins exist. Leaving documentation / asserts aside, I think this is largely a safe assumption given current callers. With one exception, it's all vacuum or recovery related code - as vacuum can't run in a transaction, we can't conflict with another pin by the same backend. The one exception is heap_surgery.c - it doesn't quite seem safe, the surrounding (or another query with a cursor) could have a pin on the target block. The most obvious fix would be to use CheckTableNotInUse(), but that might break some reasonable uses. Or maybe it should just not use a cleanup lock, it's not obvious to me why it uses one. But tbh, I don't care too much, given what heap_surgery is. > > @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) > > > > buf_state = pg_atomic_read_u32(&bufHdr->state); > > > > - if (check_unreferenced && LocalRefCount[bufid] != 0) > > + /* > > + * We need to test not just LocalRefCount[bufid] but also the BufferDesc > > + * itself, as the latter is used to represent a pin by the AIO subsystem. > > + * This can happen if AIO is initiated and then the query errors out. > > + */ > > + if (check_unreferenced && > > + (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0)) > > elog(ERROR, "block %u of %s is still referenced (local %u)", > > I didn't write a test to prove it, but I'm suspecting we'll reach the above > ERROR with this sequence: > > CREATE TEMP TABLE foo ...; > [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing] > DROP TABLE foo; That seems plausible. I'll try to write a test after this email. > DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I > think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for > the particular rel) before InvalidateLocalBuffer(). Or use something like the > logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in > corresponding bufmgr code. Just waiting for the IO in InvalidateBuffer() does seem like the best bet to me. It's going to be pretty rarely reached, waiting for all concurrent IO seems unnecessarily heavyweight. I don't think it matters much today, but once we do things like asynchronously writing back buffers or WAL, the situation will be different. I think this points to the comment above the WaitIO() in InvalidateBuffer() needing a bit of adapting - an in-progress read can trigger the WaitIO as well. Something like: /* * We assume the reason for it to be pinned is that either we were * asynchronously reading the page in before erroring out or someone else * is flushing the page out. Wait for the IO to finish. (This could be * an infinite loop if the refcount is messed up... it would be nice to * time out after awhile, but there seems no way to be sure how many loops * may be needed. Note that if the other guy has pinned the buffer but * not yet done StartBufferIO, WaitIO will fall through and we'll * effectively be busy-looping here.) */ > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > > + bool failed, bool is_temp) > > +{ > ... > > + PgAioResult result; > ... > > + result.status = PGAIO_RS_OK; > ... > > + return result; > > gcc 14.2.0 -Werror gives me: > > bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized] Gngngng. Since when is it a bug for some fields of a struct to be uninitialized, as long they're not used? Interestingly I don't see that warning, despite also using gcc 14.2.0. I'll just move to your solution, but it seems odd. > > Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO > > > @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream) > > static void > > read_stream_look_ahead(ReadStream *stream) > > { > > + /* > > + * Allow amortizing the cost of submitting IO over multiple IOs. This > > + * requires that we don't do any operations that could lead to a deadlock > > + * with staged-but-unsubmitted IO. > > + */ > > + pgaio_enter_batchmode(); > > We call read_stream_get_block() while in batchmode, so the stream callback > needs to be ready for that. A complicated case is > collect_corrupt_items_read_stream_next_block(), which may do its own buffer > I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a > recipe for corner cases reaching ERROR "starting batch while batch already in > progress". Are there mitigating factors? Ugh, yes, you're right. heap_vac_scan_next_block() is also affected. I don't think "starting batch while batch already in progress" is the real issue though - it seems easy enough to avoid starting another batch inside, partially because current cases seem unlikely to need to do batchable IO inside. What worries me more is that code might block while there's unsubmitted IO - which seems entirely plausible. I can see a few approaches: 1) Declare that all read stream callbacks have to be careful and cope with batch mode I'm not sure how viable that is, not starting batches seems ok, but ensuring that the code doesn't block is a different story. 2) Have read stream users opt-in to batching Presumably via a flag like READ_STREAM_USE_BATCHING. That'd be easy enough to implement and to add to the callsites where that's fine. 3) Teach read stream to "look ahead" far enough to determine all the blocks that could be issued in a batch outside of batchmode I think that's probably not a great idea, it'd lead us to looking further ahead than we really need to, which could increase "unfairness" in e.g. parallel sequential scan. 4) Just defer using batch mode for now It's a nice win with io_uring for random IO, e.g. from bitmap heap scans , but there's no need to immediately solve this. I think regardless of what we go for, it's worth splitting "aio: Basic read_stream adjustments for real AIO" into the actually basic parts (i.e. introducing sync_mode) from the not actually so basic parts (i.e. batching). I suspect that 2) would be the best approach. Only the read stream user knows what it needs to do in the callback. > > > Subject: [PATCH v2.11 17/27] aio: Add test_aio module > > > + # verify that page verification errors are detected even as part of a > > + # shortened multi-block read (tbl_corr, block 1 is tbl_corred) > > Is "tbl_corred" a typo of something? I think that was a search&replace of the table name gone wrong. It was just supposed to be "corrupted". > > + * > > + * IDENTIFICATION > > + * src/test/modules/delay_execution/delay_execution.c > > Header comment is surviving from copy-paste of delay_execution.c. Oh, how I hate these pointless comments. Fixed. Greetings, Andres Freund
pgsql-hackers by date: