Re: AIO v2.2 - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: AIO v2.2 |
Date | |
Msg-id | 20250107191059.24.nmisch@google.com Whole thread Raw |
In response to | Re: AIO v2.2 (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Mon, Jan 06, 2025 at 04:40:26PM -0500, Andres Freund wrote: > On 2025-01-06 10:52:20 -0800, Noah Misch wrote: > > On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote: > - We have pretty no testing for IO errors. Yes, that's remained a gap. I've wondered how much to address this via targeted tests of specific sites vs. fuzzing, iterative fault injection, or some other approach closer to brute force. > > I'd be most interested in the > > cases that would be undetected deadlocks under a naive design. An example > > appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com > > That's a good one, yea. > > I think I'll try to translate the regression tests I wrote into an isolation > test, I hope that'll make it a bit easier to cover more cases. > > And then we'll need more injection points, I'm afraid :(. Sounds good. > > * - method_*.c - different ways of executing AIO (e.g. worker process) > > * - aio_io.c - method-independent code for specific IO ops (e.g. readv) > > * - aio_subject.c - callbacks at IO operation lifecycle events > > * - aio_init.c - per-fork and per-startup-process initialization > > I don't particularly like "per-startup-process", because "global > initialization" really is separate (and precedes) from startup processes > startup. Maybe "per-server and per-backend initialization"? That works for me. I wrote "per-startup-process" because it can happen more than once in a postmaster that reaches "all server processes terminated; reinitializing". That said, there's little risk of "per-server" giving folks a materially wrong idea. > > * - aio.c - all other topics > > * - read_stream.c - helper for reading buffered relation data > > Did the order you listed the files have a system to it? If so, what is it? The rough idea was to avoid forward references: * - method_*.c - different ways of executing AIO (e.g. worker process) makes sense without other background * - aio_io.c - method-independent code for specific IO ops (e.g. readv) refers to methods, so listed after methods * - aio_subject.c - callbacks at IO operation lifecycle events refers to IO ops, so listed after aio_io.c * - aio_init.c - per-fork and per-startup-process initialization no surprise that this code will exist somewhere, so list it lower to deemphasize it * - aio.c - all other topics default route, hence last * - read_stream.c - helper for reading buffered relation data could just as easily come first, not last could be under a distinct heading like "Recommended abstractions:" > > I'd benefit from seeing things in this order: > > > > - "why" > > - condensed usage example like manpage SYNOPSIS, comments and decls removed > > - PgAioHandleState and discussion of valid transitions > > Hm - why have PgAioHandleState and its states before the usage example? Seems > like it'd be harder to understand that way. I usually look at the data structures before the code that manipulates them. (Similarly, I look at the map before the directions.) I wouldn't mind it appearing after the usage example, since order preferences do vary. > > - usage example as it is, with full comments > > - the rest > > > > ## Synopsis > > > > ioh = pgaio_io_get(CurrentResourceOwner, &ioret); > > pgaio_io_get_ref(ioh, &ior); > > pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ); > > pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1); > > smgrstartreadv(ioh, operation->smgr, forknum, blkno, > > BufferGetBlock(buffer), 1); > > pgaio_submit_staged(); > > pgaio_io_ref_wait(&ior); > > if (ioret.result.status == ARS_ERROR) > > pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR); > > Happy to add this, but I'm not entirely sure if that's really that useful to > have without commentary? The synopsis in manpages is helpful because it > provides the signature of various functions, but this wouldn't... I'm not sure either. Let's drop that idea. > > > +### IO can be started in critical sections > > ... > > > +The need to be able to execute IO in critical sections has substantial design > > > +implication on the AIO subsystem. Mainly because completing IOs (see prior > > > +section) needs to be possible within a critical section, even if the > > > +to-be-completed IO itself was not issued in a critical section. Consider > > > +e.g. the case of a backend first starting a number of writes from shared > > > +buffers and then starting to flush the WAL. Because only a limited amount of > > > +IO can be in-progress at the same time, initiating the IO for flushing the WAL > > > +may require to first finish executing IO executed earlier. > > > > The last line's two appearances of the word "execute" read awkwardly to me, > > and it's an opportunity to use PgAioHandleState terms. Consider writing the > > last line like "may first advance an existing IO from AHS_PREPARED to > > AHS_COMPLETED_SHARED". > > It is indeed awkward. I don't love referencing the state-constants here > though, somehow that feels like a reference-cycle ;). What about this: > > > ... Consider > > e.g. the case of a backend first starting a number of writes from shared > > buffers and then starting to flush the WAL. Because only a limited amount of > > IO can be in-progress at the same time, initiating IO for flushing the WAL may > > require to first complete IO that was started earlier. That's non-awkward. I like specific state names here since "complete" could mean AHS_COMPLETED_SHARED or AHS_COMPLETED_LOCAL, and it matters here. If the state names changed so AHS_COMPLETED_LOCAL dropped the word "complete", that too would solve it. > > > +### AIO Callbacks > > ... > > > +In addition to completion, AIO callbacks also are called to "prepare" an > > > +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for > > > +IO to/from shared buffers, which is required to handle the case where the > > > +issuing backend errors out and releases its own pins. > > > > Reading this, it's not obvious to me how to reconcile "finishing an IO could > > require pin acquisition" with "finishing an IO could happen in a critical > > section". Pinning a buffer in a critical section sounds bad. I vaguely > > recall understanding how it was okay as of my September review, but I've > > already forgotten. Can this text have a sentence making that explicit? > > Ah, yes, that's easy to misunderstand. The answer basically is that we don't > newly pin a buffer, we just increment the reference count by 1. That should > never fail. > > How about: > > In addition to completion, AIO callbacks also are called to "prepare" an > > IO. This is, e.g., used to increase buffer reference counts to account for the > > AIO subsystem referencing the buffer, which is required to handle the case > > where the issuing backend errors out and releases its own pins while the IO is > > still ongoing. Perfect. > > > +### AIO Subjects > > > + > > > +In addition to the completion callbacks describe above, each AIO Handle has > > > +exactly one "subject". Each subject has some space inside an AIO Handle with > > > +information specific to the subject and can provide callbacks to allow to > > > +reopen the underlying file (required for worker mode) and to describe the IO > > > +operation (used for debug logging and error messages). > > > > Can this say roughly how to decide when to add a new subject? > > Hm, there obviously is some fuzziness. I was trying to get to some of that by > mentioning that the subject needs to know how to [re-]open a file and describe > the target of the IO in terms that make sense to the user. > > E.g. smgr seemed to make sense as a subject as the smgr layer knows how to > open a file by delegating that to the layer below and the layer above just > knows about smgr, not md.c (or other potential smgr implementations). > > The reason to keep this separate from the callbacks was that smgr IO going > through shared buffers, bypassing shared buffers and different smgr > implemenentations all could share the same subject implementation, even if > callbacks would differ between these use cases. > > > How about: > > > I.e., if two different uses of AIO can describe the identity of the file being > > operated on the same way, it likely makes sense to use the same > > subject. E.g. different smgr implementations can describe IO with > > RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In > > contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr > > and it would not make sense to use the same subject for smgr and WAL. Sounds good to include. > > Can this have a sentence on how this fits in bounded shmem, given the lack of > > guarantees about a backend's responsiveness? In other words, what makes it > > okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED > > to AHS_COMPLETED_LOCAL? > > I agree this should be explained somewhere - but not sure this is the best > place. > > The reason it's ok is that each backend has a limited number of AIO handles > and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b) > wait for the oldest IO to finish. Reading it again today, that topic may already have adequate coverage.
pgsql-hackers by date: