Re: AIO v2.5 - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 20250324002939.5c.nmisch@google.com Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
commit 247ce06b wrote: > + pgaio_io_reopen(ioh); > + > + /* > + * To be able to exercise the reopen-fails path, allow injection > + * points to trigger a failure at this point. > + */ > + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN"); > + > + error_errno = 0; > + error_ioh = NULL; > + > + /* > + * We don't expect this to ever fail with ERROR or FATAL, no need > + * to keep error_ioh set to the IO. > + * pgaio_io_perform_synchronously() contains a critical section to > + * ensure we don't accidentally fail. > + */ > + pgaio_io_perform_synchronously(ioh); A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee smgr_aio_reopen() stores. Hence, I think smgrfd() should assert that interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a HOLD_INTERRUPTS() should surround the above region of code. It's likely hard to reproduce a problem, because pgaio_io_call_inj() does nothing in many builds, and pgaio_io_perform_synchronously() starts by entering a critical section. On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote: > Attached v2.11 > Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd > +int > +FileStartReadV(PgAioHandle *ioh, File file, > + int iovcnt, off_t offset, > + uint32 wait_event_info) > +{ > + int returnCode; > + Vfd *vfdP; > + > + Assert(FileIsValid(file)); > + > + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d", > + file, VfdCache[file].fileName, > + (int64) offset, > + iovcnt)); > + > + returnCode = FileAccess(file); > + if (returnCode < 0) > + return returnCode; > + > + vfdP = &VfdCache[file]; > + > + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset); FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode. I didn't expect that from functions so named. The "start" verb sounds to me like unconditional PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED. I like the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at PGAIO_HS_STAGED succinctly covers what to expect. Hence, I recommend names FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread. How do you see it? > +/* > + * AIO error reporting callback for mdstartreadv(). > + * > + * Errors are encoded as follows: > + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0 I recommend replacing "errno != 0" with either "that errno" or "errno == error_data". > Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design Ready for commit apart from some trivia: > +if (ioret.result.status == PGAIO_RS_ERROR) > + pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR); I think ioret and aio_ret are supposed to be the same object. If that's right, change one of the names. Likewise elsewhere in this file. > +The central API piece for postgres' AIO abstraction are AIO handles. To > +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and > +then "defined", i.e. associate an IO operation with the handle. s/"defined"/"define" it/ or similar > +The "solution" to this the ability to associate multiple completion callbacks s/this the/this is the/ > Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well > @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer) > Assert(refcount > 0); > if (refcount != 1) > return false; > + > + /* > + * Check that the AIO subsystem doesn't have a pin. Likely not > + * possible today, but better safe than sorry. > + */ > + bufHdr = GetLocalBufferDescriptor(-buffer - 1); > + buf_state = pg_atomic_read_u32(&bufHdr->state); > + refcount = BUF_STATE_GET_REFCOUNT(buf_state); > + Assert(refcount > 0); > + if (refcount != 1) > + return false; > + LockBufferForCleanup() should get code like this ConditionalLockBufferForCleanup() code, either now or when "not possible today" ends. Currently, it just assumes all local buffers are cleanup-lockable: /* Nobody else to wait for */ if (BufferIsLocal(buffer)) return; > @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) > > buf_state = pg_atomic_read_u32(&bufHdr->state); > > - if (check_unreferenced && LocalRefCount[bufid] != 0) > + /* > + * We need to test not just LocalRefCount[bufid] but also the BufferDesc > + * itself, as the latter is used to represent a pin by the AIO subsystem. > + * This can happen if AIO is initiated and then the query errors out. > + */ > + if (check_unreferenced && > + (LocalRefCount[bufid] != 0 || BUF_STATE_GET_REFCOUNT(buf_state) != 0)) > elog(ERROR, "block %u of %s is still referenced (local %u)", I didn't write a test to prove it, but I'm suspecting we'll reach the above ERROR with this sequence: CREATE TEMP TABLE foo ...; [some command that starts reading a block of foo into local buffers, then ERROR with IO ongoing] DROP TABLE foo; DropRelationAllLocalBuffers() calls InvalidateLocalBuffer(bufHdr, true). I think we'd need to do like pgaio_shutdown() and finish all IOs (or all IOs for the particular rel) before InvalidateLocalBuffer(). Or use something like the logic near elog(ERROR, "buffer is pinned in InvalidateBuffer") in corresponding bufmgr code. I think that bufmgr ERROR is unreachable, since only a private refcnt triggers that bufmgr ERROR. Is there something preventing the localbuf error from being a problem? (This wouldn't require changes to the current patch; responsibility would fall in a bufmgr AIO patch.) > Subject: [PATCH v2.11 09/27] bufmgr: Implement AIO read support (Still reviewing this and later patches, but incidental observations follow.) > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, > + bool failed, bool is_temp) > +{ ... > + PgAioResult result; ... > + result.status = PGAIO_RS_OK; ... > + return result; gcc 14.2.0 -Werror gives me: bufmgr.c:7297:16: error: ‘result’ may be used uninitialized [-Werror=maybe-uninitialized] Zeroing the unset fields silenced it: --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -7221,3 +7221,3 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, char *bufdata = BufferGetBlock(buffer); - PgAioResult result; + PgAioResult result = { .status = PGAIO_RS_OK }; uint32 set_flag_bits; @@ -7238,4 +7238,2 @@ buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags, - result.status = PGAIO_RS_OK; - /* check for garbage data */ > Subject: [PATCH v2.11 13/27] aio: Basic read_stream adjustments for real AIO > @@ -416,6 +418,13 @@ read_stream_start_pending_read(ReadStream *stream) > static void > read_stream_look_ahead(ReadStream *stream) > { > + /* > + * Allow amortizing the cost of submitting IO over multiple IOs. This > + * requires that we don't do any operations that could lead to a deadlock > + * with staged-but-unsubmitted IO. > + */ > + pgaio_enter_batchmode(); We call read_stream_get_block() while in batchmode, so the stream callback needs to be ready for that. A complicated case is collect_corrupt_items_read_stream_next_block(), which may do its own buffer I/O to read in a vmbuffer for VM_ALL_FROZEN(). That's feeling to me like a recipe for corner cases reaching ERROR "starting batch while batch already in progress". Are there mitigating factors? > Subject: [PATCH v2.11 17/27] aio: Add test_aio module > + # verify that page verification errors are detected even as part of a > + # shortened multi-block read (tbl_corr, block 1 is tbl_corred) Is "tbl_corred" a typo of something? > --- /dev/null > +++ b/src/test/modules/test_aio/test_aio.c > @@ -0,0 +1,657 @@ > +/*------------------------------------------------------------------------- > + * > + * delay_execution.c > + * Test module to allow delay between parsing and execution of a query. > + * > + * The delay is implemented by taking and immediately releasing a specified > + * advisory lock. If another process has previously taken that lock, the > + * current process will be blocked until the lock is released; otherwise, > + * there's no effect. This allows an isolationtester script to reliably > + * test behaviors where some specified action happens in another backend > + * between parsing and execution of any desired query. > + * > + * Copyright (c) 2020-2025, PostgreSQL Global Development Group > + * > + * IDENTIFICATION > + * src/test/modules/delay_execution/delay_execution.c Header comment is surviving from copy-paste of delay_execution.c. > + * Tor tests we don't want the resowner release preventing us from s/Tor/For/
pgsql-hackers by date: