Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 7co5jtx2jkvpg5pde7f5wyub6w6zptfw4sbuj4n5j4iedj5zlg@u3mzxsgopnhl Whole thread Raw |
In response to | Re: AIO v2.5 (Noah Misch <noah@leadboat.com>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
Hi, On 2025-03-19 18:11:18 -0700, Noah Misch wrote: > On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote: > > On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > > Hm, we retry more frequently that that if there are new connections... Maybe > > just "try again next time"? > > Works for me. > > > > And these individual lines from "git grep BM_IO_IN_PROGRESS": > > > > > > * i.e at most one BM_IO_IN_PROGRESS bit is set per proc. > > > > > > The last especially. > > > > Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's > > 12f3867f5534. I think the comment can just be deleted? > > Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine. > > > > * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the > > > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > > > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > > > > > For the other three lines and the paragraph, the notion > > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or > > > being the process "doing a read" becomes less significant when one process > > > starts the IO and another completes it. > > > > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe > > Looking again, I agree they're okay. > > > > > * Now it's safe to write buffer to disk. Note that no one else should > > * have been able to write it while we were busy with log flushing because > > * we got the exclusive right to perform I/O by setting the > > * BM_IO_IN_PROGRESS bit. > > That's fine too. Maybe s/perform/stage/ or s/perform/start/. I put these comment changes into their own patch, as it seemed confusing to change them as part of one of the already queued commits. > > > I see this relies on md_readv_complete having converted "result" to blocks. > > > Was there some win from doing that as opposed to doing the division here? > > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier > > > to follow, to me. > > > > It seemed like that would be wrong layering - what if we had an smgr that > > could store data in a compressed format? The raw read would be of a smaller > > size. The smgr API deals in BlockNumbers, only the md.c layer should know > > about bytes. > > I hadn't thought of that. That's a good reason. I thought that was better documented, but alas, it wasn't. How about updating the documentation of smgrstartreadv to the following: /* * smgrstartreadv() -- asynchronous version of smgrreadv() * * This starts an asynchronous readv IO using the IO handle `ioh`. Other than * `ioh` all parameters are the same as smgrreadv(). * * Completion callbacks above smgr will be passed the result as the number of * successfully read blocks if the read [partially] succeeds. This maintains * the abstraction that smgr operates on the level of blocks, rather than * bytes. */ I briefly had a bug in test_aio's injection point that lead to *increasing* the number of bytes successfully read. That triggered an assertion failure in bufmgr.c, but not closer to the problem. Is it worth adding an assert against that to md_readv_complete? Can't quite decide. Greetings, Andres Freund
pgsql-hackers by date: