Re: AIO v2.5 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | cz6eu7prqdibz35nljpqwyzyp37afncsp4r3fk4lkssiecdjck@vwew2c2vs2ar Whole thread Raw |
In response to | Re: AIO v2.5 (Noah Misch <noah@leadboat.com>) |
Responses |
Re: AIO v2.5
Re: AIO v2.5 |
List | pgsql-hackers |
Hi, On 2025-03-19 14:25:30 -0700, Noah Misch wrote: > On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote: > > - Right now effective_io_concurrency cannot be set > 0 on Windows and other > > platforms that lack posix_fadvise. But with AIO we can read ahead without > > posix_fadvise(). > > > > It'd not really make anything worse than today to not remove the limit, but > > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need > > to look around and see whether it would require anything other than doc > > changes. > > Worth changing, but non-blocking. Thankfully Melanie submitted a patch for that... > Other than the smgr patch review sent on its own thread, I've not yet reviewed > any of these patches comprehensively. Given the speed of change, I felt it > was time to flush comments buffered since 2025-03-11: Thanks! > commit 0284401 wrote: > > aio: Basic subsystem initialization > > > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len) > > */ > > LWLockReleaseAll(); > > pgstat_report_wait_end(); > > + pgaio_error_cleanup(); > > AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and > WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that > proper? They do call pgaio_error_cleanup() as seen here, so the only loss is > some asserts. (The load-bearing part does get done.) I don't think it's particularly good that we use the AtEOXact_* functions in the sigsetjmp blocks, that feels like a weird mixup of infrastructure to me. So this was intentional. > commit da72269 wrote: > > aio: Add core asynchronous I/O infrastructure > > > + * This could be in aio_internal.h, as it is not pubicly referenced, but > > typo -> publicly /me has a red face. > commit 55b454d wrote: > > aio: Infrastructure for io_method=worker > > > + /* Try to launch one. */ > > + child = StartChildProcess(B_IO_WORKER); > > + if (child != NULL) > > + { > > + io_worker_children[id] = child; > > + ++io_worker_count; > > + } > > + else > > + break; /* XXX try again soon? */ > > I'd change the comment to something like one of: > > retry after DetermineSleepTime() > next LaunchMissingBackgroundProcesses() will retry in <60s Hm, we retry more frequently that that if there are new connections... Maybe just "try again next time"? > On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote: > > - Decide what to do about the smgr interrupt issue > > Replied on that thread. It's essentially ready. Cool, will reply there in a bit. > > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support > > Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph: > > * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a > buffer to complete (and in releases before 14, it was accompanied by a > per-buffer LWLock). The process doing a read or write sets the flag for the > duration, and processes that need to wait for it to be cleared sleep on a > condition variable. First draft: * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a buffer to complete (and in releases before 14, it was accompanied by a per-buffer LWLock). The process start a read or write sets the flag. When the I/O is completed, be it by the process that initiated the I/O or by another process, the flag is removed and the Buffer's condition variable is signalled. Processes that need to wait for the I/O to complete can wait for asynchronous I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by sleeping on the buffer's condition variable. > And these individual lines from "git grep BM_IO_IN_PROGRESS": > > * i.e at most one BM_IO_IN_PROGRESS bit is set per proc. > > The last especially. Huh - yea. This isn't a "new" issue, I think I missed this comment in 16's 12f3867f5534. I think the comment can just be deleted? > * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > * only one process at a time can set the BM_IO_IN_PROGRESS bit. > For the other three lines and the paragraph, the notion > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or > being the process "doing a read" becomes less significant when one process > starts the IO and another completes it. Hm. I think they'd be ok as-is, but we can probably improve them. Maybe * Now it's safe to write buffer to disk. Note that no one else should * have been able to write it while we were busy with log flushing because * we got the exclusive right to perform I/O by setting the * BM_IO_IN_PROGRESS bit. > > + /* we better have ensured the buffer is present until now */ > > + Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1); > > I'd delete that comment; to me, the assertion alone is clearer. Ok. > > + ereport(LOG, > > + (errcode(ERRCODE_DATA_CORRUPTED), > > + errmsg("invalid page in block %u of relation %s; zeroing out page", > > This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's > goals; is it needed? If so, I recommend splitting it out as a preliminary > patch, to highlight the behavior change for release notes. No, it's not needed. I think I looked over the patch at some point and considered the log-level wrong according to our guidelines and thought I'd broken it. > > + /* > > + * If the entire failed on a lower-level, each buffer needs to be > > Missing word, probably fix like: > s,entire failed on a lower-level,entire I/O failed on a lower level, Yep. > > + * marked as failed. In case of a partial read, some buffers may be > > + * ok. > > + */ > > + failed = > > + prior_result.status == ARS_ERROR > > + || prior_result.result <= buf_off; > > I didn't run an experiment to check the following, but I think this should be > s/<=/</. Suppose we requested two blocks and read some amount of bytes > [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0 > should compute failed=false here, but buf_off==1 should compute failed=true. Huh, you might be right. I thought I wrote a test for this, I wonder why it didn't catch the problem... > I see this relies on md_readv_complete having converted "result" to blocks. > Was there some win from doing that as opposed to doing the division here? > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier > to follow, to me. It seemed like that would be wrong layering - what if we had an smgr that could store data in a compressed format? The raw read would be of a smaller size. The smgr API deals in BlockNumbers, only the md.c layer should know about bytes. > > + > > + buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed, > > + is_temp); > > + > > + /* > > + * If there wasn't any prior error and the IO for this page failed in > > + * some form, set the whole IO's to the page's result. > > s/the IO for this page/page verification/ > s/IO's/IO's result/ Agreed. Thanks for the review! Greetings, Andres Freund
pgsql-hackers by date: