Re: AIO v2.5 - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 20250319212530.80.nmisch@google.com Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
|
List | pgsql-hackers |
On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote: > On 2025-03-11 20:57:43 -0700, Noah Misch wrote: > > - Like you say, "redefine max_files_per_process to be about the number of > > files each *backend* will additionally open". It will become normal that > > each backend's actual FD list length is max_files_per_process + MaxBackends > > if io_method=io_uring. Outcome is not unlike > > v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch + > > v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't > > mutate max_files_per_process. Benchmark results should not change beyond > > the inter-major-version noise level unless one sets io_method=io_uring. I'm > > feeling best about this one, but I've not been thinking about it long. > > Yea, I think that's something probably worth doing separately from Jelte's > patch. I do think that it'd be rather helpful to have jelte's patch to > increase NOFILE in addition though. Agreed. > > > > > +static void > > > > > +maybe_adjust_io_workers(void) > > > > > > > > This also restarts workers that exit, so perhaps name it > > > > start_io_workers_if_missing(). > > > > > > But it also stops IO workers if necessary? > > > > Good point. Maybe just add a comment like "start or stop IO workers to close > > the gap between the running count and the configured count intent". > > It's now > /* > * Start or stop IO workers, to close the gap between the number of running > * workers and the number of configured workers. Used to respond to change of > * the io_workers GUC (by increasing and decreasing the number of workers), as > * well as workers terminating in response to errors (by starting > * "replacement" workers). > */ Excellent. > > > > > +{ > > > > ... > > > > > + /* Try to launch one. */ > > > > > + child = StartChildProcess(B_IO_WORKER); > > > > > + if (child != NULL) > > > > > + { > > > > > + io_worker_children[id] = child; > > > > > + ++io_worker_count; > > > > > + } > > > > > + else > > > > > + break; /* XXX try again soon? */ > > > > > > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this > > > > function, replacing the current mix of callers? That would be more conducive > > > > to promptly doing the right thing after launch failure. > > > > > > I'm not sure that'd be a good idea - right now IO workers are started before > > > the startup process, as the startup process might need to perform IO. If we > > > started it only later in ServerLoop() we'd potentially do a fair bit of work, > > > including starting checkpointer, bgwriter, bgworkers before we started IO > > > workers. That shouldn't actively break anything, but it would likely make > > > things slower. > > > > I missed that. How about keeping the two calls associated with PM_STARTUP but > > replacing the assign_io_workers() and process_pm_child_exit() calls with one > > in LaunchMissingBackgroundProcesses()? > > I think replacing the call in assign_io_workers() is a good idea, that way we > don't need assign_io_workers(). > > Less convinced it's a good idea to do the same for process_pm_child_exit() - > if IO workers errored out we'll launch backends etc before we get to > LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but > seems a bit odd. Works for me. > I think LaunchMissingBackgroundProcesses() should be split into one that > starts aux processes and one that starts bgworkers. The one maintaining aux > processes should be called before we start backends, the latter not. That makes sense, though I've not thought about it much. > > > > > + /* > > > > > + * It's very unlikely, but possible, that reopen fails. E.g. due > > > > > + * to memory allocations failing or file permissions changing or > > > > > + * such. In that case we need to fail the IO. > > > > > + * > > > > > + * There's not really a good errno we can report here. > > > > > + */ > > > > > + error_errno = ENOENT; > > > > > > > > Agreed there's not a good errno, but let's use a fake errno that we're mighty > > > > unlikely to confuse with an actual case of libc returning that errno. Like > > > > one of EBADF or EOWNERDEAD. > > > > > > Can we rely on that to be present on all platforms, including windows? > > > > I expect EBADF is universal. EBADF would be fine. > > Hm, that's actually an error that could happen for other reasons, and IMO > would be more confusing than ENOENT. The latter describes the issue to a > reasonable extent, EBADFD seems like it would be more confusing. > > I'm not sure it's worth investing time in this - it really shouldn't happen, > and we probably have bigger problems than the error code if it does. But if we > do want to do something, I think I can see a way to report a dedicated error > message for this. I agree it's not worth much investment. Let's leave that one as-is. We can always change it further if the not-really-good errno shows up too much. > > https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi > > lists some OSs not having it, the newest of which looks like NetBSD 9.3 > > (2022). We could use it and add a #define for platforms lacking it. > > What would we define it as? I guess we could just pick a high value, but... Some second-best value, but I withdraw that idea. On Wed, Mar 12, 2025 at 07:23:47PM -0400, Andres Freund wrote: > Attached is v2.7, with the following changes: > Unresolved: > > - Whether to continue starting new workers in process_pm_child_exit() I'm fine with that continuing. It's hurting ~nothing. > - What to name the view (currently pg_aios). I'm inclined to go for > pg_io_handles right now. I like pg_aios mildly better than pg_io_handles, since "handle" sounds implementation-centric. On Fri, Mar 14, 2025 at 03:43:15PM -0400, Andres Freund wrote: > Attached is v2.8 with the following changes: > - In parallel: Find a way to deal with the set_max_safe_fds() issue that we've > been discussing on this thread recently. As that only affects io_uring, it > doesn't have to block other patches going in. As above, I like the "redefine" option. > - Right now effective_io_concurrency cannot be set > 0 on Windows and other > platforms that lack posix_fadvise. But with AIO we can read ahead without > posix_fadvise(). > > It'd not really make anything worse than today to not remove the limit, but > it'd be pretty weird to prevent windows etc from benefiting from AIO. Need > to look around and see whether it would require anything other than doc > changes. Worth changing, but non-blocking. On Fri, Mar 14, 2025 at 03:58:43PM -0400, Andres Freund wrote: > - Should the docs for debug_io_direct be rephrased and if so, how? > Perhaps it's worth going from > > <para> > Currently this feature reduces performance, and is intended for > developer testing only. > </para> > to > <para> > Currently this feature reduces performance in many workloads, and is > intended for testing only. > </para> > > I.e. qualify the downside with "many workloads" and widen the audience ever so > slightly? Yes, that's good. Other than the smgr patch review sent on its own thread, I've not yet reviewed any of these patches comprehensively. Given the speed of change, I felt it was time to flush comments buffered since 2025-03-11: commit 0284401 wrote: > aio: Basic subsystem initialization > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len) > */ > LWLockReleaseAll(); > pgstat_report_wait_end(); > + pgaio_error_cleanup(); AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio(). Is that proper? They do call pgaio_error_cleanup() as seen here, so the only loss is some asserts. (The load-bearing part does get done.) commit da72269 wrote: > aio: Add core asynchronous I/O infrastructure > + * This could be in aio_internal.h, as it is not pubicly referenced, but typo -> publicly commit 55b454d wrote: > aio: Infrastructure for io_method=worker > + /* Try to launch one. */ > + child = StartChildProcess(B_IO_WORKER); > + if (child != NULL) > + { > + io_worker_children[id] = child; > + ++io_worker_count; > + } > + else > + break; /* XXX try again soon? */ I'd change the comment to something like one of: retry after DetermineSleepTime() next LaunchMissingBackgroundProcesses() will retry in <60s On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote: > - Decide what to do about the smgr interrupt issue Replied on that thread. It's essentially ready. > Questions / Unresolved: > > - Write support isn't going to land in 18, but there is a tiny bit of code > regarding writes in the code for bufmgr IO. I guess I could move that to a > later commit? > > I'm inclined to leave it, the structure of the code only really makes > knowing that it's going to be shared between reads & writes. Fine to leave it. > - pg_aios view name Covered above. > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support Some comments about BM_IO_IN_PROGRESS may need updates. This paragraph: * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a buffer to complete (and in releases before 14, it was accompanied by a per-buffer LWLock). The process doing a read or write sets the flag for the duration, and processes that need to wait for it to be cleared sleep on a condition variable. And these individual lines from "git grep BM_IO_IN_PROGRESS": * I/O already in progress. We already hold BM_IO_IN_PROGRESS for the * only one process at a time can set the BM_IO_IN_PROGRESS bit. * only one process at a time can set the BM_IO_IN_PROGRESS bit. * i.e at most one BM_IO_IN_PROGRESS bit is set per proc. The last especially. For the other three lines and the paragraph, the notion of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or being the process "doing a read" becomes less significant when one process starts the IO and another completes it. > + /* we better have ensured the buffer is present until now */ > + Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1); I'd delete that comment; to me, the assertion alone is clearer. > + ereport(LOG, > + (errcode(ERRCODE_DATA_CORRUPTED), > + errmsg("invalid page in block %u of relation %s; zeroing out page", This is changing level s/WARNING/LOG/. That seems orthogonal to the patch's goals; is it needed? If so, I recommend splitting it out as a preliminary patch, to highlight the behavior change for release notes. > +/* > + * Perform completion handling of a single AIO read. This read may cover > + * multiple blocks / buffers. > + * > + * Shared between shared and local buffers, to reduce code duplication. > + */ > +static pg_attribute_always_inline PgAioResult > +buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result, > + uint8 cb_data, bool is_temp) > +{ > + PgAioResult result = prior_result; > + PgAioTargetData *td = pgaio_io_get_target_data(ioh); > + uint64 *io_data; > + uint8 handle_data_len; > + > + if (is_temp) > + { > + Assert(td->smgr.is_temp); > + Assert(pgaio_io_get_owner(ioh) == MyProcNumber); > + } > + else > + Assert(!td->smgr.is_temp); > + > + /* > + * Iterate over all the buffers affected by this IO and call appropriate > + * per-buffer completion function for each buffer. > + */ > + io_data = pgaio_io_get_handle_data(ioh, &handle_data_len); > + for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++) > + { > + Buffer buf = io_data[buf_off]; > + PgAioResult buf_result; > + bool failed; > + > + Assert(BufferIsValid(buf)); > + > + /* > + * If the entire failed on a lower-level, each buffer needs to be Missing word, probably fix like: s,entire failed on a lower-level,entire I/O failed on a lower level, > + * marked as failed. In case of a partial read, some buffers may be > + * ok. > + */ > + failed = > + prior_result.status == ARS_ERROR > + || prior_result.result <= buf_off; I didn't run an experiment to check the following, but I think this should be s/<=/</. Suppose we requested two blocks and read some amount of bytes [1*BLCKSZ, 2*BLSCKSZ - 1]. md_readv_complete will store result=1. buf_off==0 should compute failed=false here, but buf_off==1 should compute failed=true. I see this relies on md_readv_complete having converted "result" to blocks. Was there some win from doing that as opposed to doing the division here? Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier to follow, to me. > + > + buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed, > + is_temp); > + > + /* > + * If there wasn't any prior error and the IO for this page failed in > + * some form, set the whole IO's to the page's result. s/the IO for this page/page verification/ s/IO's/IO's result/ > + */ > + if (result.status != ARS_ERROR && buf_result.status != ARS_OK) > + { > + result = buf_result; > + pgaio_result_report(result, td, LOG); > + } > + } > + > + return result; > +}
pgsql-hackers by date: