Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250319212530.80.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> On 2025-03-11 20:57:43 -0700, Noah Misch wrote:
> > - Like you say, "redefine max_files_per_process to be about the number of
> >   files each *backend* will additionally open".  It will become normal that
> >   each backend's actual FD list length is max_files_per_process + MaxBackends
> >   if io_method=io_uring.  Outcome is not unlike
> >   v6-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch +
> >   v6-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch but we don't
> >   mutate max_files_per_process.  Benchmark results should not change beyond
> >   the inter-major-version noise level unless one sets io_method=io_uring.  I'm
> >   feeling best about this one, but I've not been thinking about it long.
> 
> Yea, I think that's something probably worth doing separately from Jelte's
> patch.  I do think that it'd be rather helpful to have jelte's patch to
> increase NOFILE in addition though.

Agreed.

> > > > > +static void
> > > > > +maybe_adjust_io_workers(void)
> > > >
> > > > This also restarts workers that exit, so perhaps name it
> > > > start_io_workers_if_missing().
> > > 
> > > But it also stops IO workers if necessary?
> > 
> > Good point.  Maybe just add a comment like "start or stop IO workers to close
> > the gap between the running count and the configured count intent".
> 
> It's now
> /*
>  * Start or stop IO workers, to close the gap between the number of running
>  * workers and the number of configured workers.  Used to respond to change of
>  * the io_workers GUC (by increasing and decreasing the number of workers), as
>  * well as workers terminating in response to errors (by starting
>  * "replacement" workers).
>  */

Excellent.

> > > > > +{
> > > > ...
> > > > > +        /* Try to launch one. */
> > > > > +        child = StartChildProcess(B_IO_WORKER);
> > > > > +        if (child != NULL)
> > > > > +        {
> > > > > +            io_worker_children[id] = child;
> > > > > +            ++io_worker_count;
> > > > > +        }
> > > > > +        else
> > > > > +            break;                /* XXX try again soon? */
> > > >
> > > > Can LaunchMissingBackgroundProcesses() become the sole caller of this
> > > > function, replacing the current mix of callers?  That would be more conducive
> > > > to promptly doing the right thing after launch failure.
> > > 
> > > I'm not sure that'd be a good idea - right now IO workers are started before
> > > the startup process, as the startup process might need to perform IO. If we
> > > started it only later in ServerLoop() we'd potentially do a fair bit of work,
> > > including starting checkpointer, bgwriter, bgworkers before we started IO
> > > workers.  That shouldn't actively break anything, but it would likely make
> > > things slower.
> > 
> > I missed that.  How about keeping the two calls associated with PM_STARTUP but
> > replacing the assign_io_workers() and process_pm_child_exit() calls with one
> > in LaunchMissingBackgroundProcesses()?
> 
> I think replacing the call in assign_io_workers() is a good idea, that way we
> don't need assign_io_workers().
> 
> Less convinced it's a good idea to do the same for process_pm_child_exit() -
> if IO workers errored out we'll launch backends etc before we get to
> LaunchMissingBackgroundProcesses(). That's not a fundamental problem, but
> seems a bit odd.

Works for me.

> I think LaunchMissingBackgroundProcesses() should be split into one that
> starts aux processes and one that starts bgworkers. The one maintaining aux
> processes should be called before we start backends, the latter not.

That makes sense, though I've not thought about it much.

> > > > > +            /*
> > > > > +             * It's very unlikely, but possible, that reopen fails. E.g. due
> > > > > +             * to memory allocations failing or file permissions changing or
> > > > > +             * such.  In that case we need to fail the IO.
> > > > > +             *
> > > > > +             * There's not really a good errno we can report here.
> > > > > +             */
> > > > > +            error_errno = ENOENT;
> > > >
> > > > Agreed there's not a good errno, but let's use a fake errno that we're mighty
> > > > unlikely to confuse with an actual case of libc returning that errno.  Like
> > > > one of EBADF or EOWNERDEAD.
> > > 
> > > Can we rely on that to be present on all platforms, including windows?
> > 
> > I expect EBADF is universal.  EBADF would be fine.
> 
> Hm, that's actually an error that could happen for other reasons, and IMO
> would be more confusing than ENOENT. The latter describes the issue to a
> reasonable extent, EBADFD seems like it would be more confusing.
> 
> I'm not sure it's worth investing time in this - it really shouldn't happen,
> and we probably have bigger problems than the error code if it does. But if we
> do want to do something, I think I can see a way to report a dedicated error
> message for this.

I agree it's not worth much investment.  Let's leave that one as-is.  We can
always change it further if the not-really-good errno shows up too much.

> > https://github.com/coreutils/gnulib/blob/master/doc/posix-headers/errno.texi
> > lists some OSs not having it, the newest of which looks like NetBSD 9.3
> > (2022).  We could use it and add a #define for platforms lacking it.
> 
> What would we define it as?  I guess we could just pick a high value, but...

Some second-best value, but I withdraw that idea.

On Wed, Mar 12, 2025 at 07:23:47PM -0400, Andres Freund wrote:
> Attached is v2.7, with the following changes:

> Unresolved:
> 
> - Whether to continue starting new workers in process_pm_child_exit()

I'm fine with that continuing.  It's hurting ~nothing.

> - What to name the view (currently pg_aios). I'm inclined to go for
>   pg_io_handles right now.

I like pg_aios mildly better than pg_io_handles, since "handle" sounds
implementation-centric.

On Fri, Mar 14, 2025 at 03:43:15PM -0400, Andres Freund wrote:
> Attached is v2.8 with the following changes:

> - In parallel: Find a way to deal with the set_max_safe_fds() issue that we've
>   been discussing on this thread recently. As that only affects io_uring, it
>   doesn't have to block other patches going in.

As above, I like the "redefine" option.

> - Right now effective_io_concurrency cannot be set > 0 on Windows and other
>   platforms that lack posix_fadvise. But with AIO we can read ahead without
>   posix_fadvise().
> 
>   It'd not really make anything worse than today to not remove the limit, but
>   it'd be pretty weird to prevent windows etc from benefiting from AIO.  Need
>   to look around and see whether it would require anything other than doc
>   changes.

Worth changing, but non-blocking.

On Fri, Mar 14, 2025 at 03:58:43PM -0400, Andres Freund wrote:
> - Should the docs for debug_io_direct be rephrased and if so, how?

> Perhaps it's worth going from
> 
>        <para>
>         Currently this feature reduces performance, and is intended for
>         developer testing only.
>        </para>
> to
>        <para>
>         Currently this feature reduces performance in many workloads, and is
>         intended for testing only.
>        </para>
> 
> I.e. qualify the downside with "many workloads" and widen the audience ever so
> slightly?

Yes, that's good.


Other than the smgr patch review sent on its own thread, I've not yet reviewed
any of these patches comprehensively.  Given the speed of change, I felt it
was time to flush comments buffered since 2025-03-11:

commit 0284401 wrote:
>     aio: Basic subsystem initialization

> @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
>           */
>          LWLockReleaseAll();
>          pgstat_report_wait_end();
> +        pgaio_error_cleanup();

AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio().  Is that
proper?  They do call pgaio_error_cleanup() as seen here, so the only loss is
some asserts.  (The load-bearing part does get done.)

commit da72269 wrote:
>     aio: Add core asynchronous I/O infrastructure

> + * This could be in aio_internal.h, as it is not pubicly referenced, but

typo -> publicly

commit 55b454d wrote:
>     aio: Infrastructure for io_method=worker

> +        /* Try to launch one. */
> +        child = StartChildProcess(B_IO_WORKER);
> +        if (child != NULL)
> +        {
> +            io_worker_children[id] = child;
> +            ++io_worker_count;
> +        }
> +        else
> +            break;                /* XXX try again soon? */

I'd change the comment to something like one of:

  retry after DetermineSleepTime()
  next LaunchMissingBackgroundProcesses() will retry in <60s

On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> - Decide what to do about the smgr interrupt issue

Replied on that thread.  It's essentially ready.

> Questions / Unresolved:
> 
> - Write support isn't going to land in 18, but there is a tiny bit of code
>   regarding writes in the code for bufmgr IO. I guess I could move that to a
>   later commit?
> 
>   I'm inclined to leave it, the structure of the code only really makes
>   knowing that it's going to be shared between reads & writes.

Fine to leave it.

> - pg_aios view name

Covered above.

> Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support

Some comments about BM_IO_IN_PROGRESS may need updates.  This paragraph:

* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock).  The process doing a read or write sets the flag for the
duration, and processes that need to wait for it to be cleared sleep on a
condition variable.

And these individual lines from "git grep BM_IO_IN_PROGRESS":
         * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
     * only one process at a time can set the BM_IO_IN_PROGRESS bit.
     * only one process at a time can set the BM_IO_IN_PROGRESS bit.
 *    i.e at most one BM_IO_IN_PROGRESS bit is set per proc.

The last especially.  For the other three lines and the paragraph, the notion
of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
being the process "doing a read" becomes less significant when one process
starts the IO and another completes it.

> +        /* we better have ensured the buffer is present until now */
> +        Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);

I'd delete that comment; to me, the assertion alone is clearer.

> +            ereport(LOG,
> +                    (errcode(ERRCODE_DATA_CORRUPTED),
> +                     errmsg("invalid page in block %u of relation %s; zeroing out page",

This is changing level s/WARNING/LOG/.  That seems orthogonal to the patch's
goals; is it needed?  If so, I recommend splitting it out as a preliminary
patch, to highlight the behavior change for release notes.

> +/*
> + * Perform completion handling of a single AIO read. This read may cover
> + * multiple blocks / buffers.
> + *
> + * Shared between shared and local buffers, to reduce code duplication.
> + */
> +static pg_attribute_always_inline PgAioResult
> +buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
> +                      uint8 cb_data, bool is_temp)
> +{
> +    PgAioResult result = prior_result;
> +    PgAioTargetData *td = pgaio_io_get_target_data(ioh);
> +    uint64       *io_data;
> +    uint8        handle_data_len;
> +
> +    if (is_temp)
> +    {
> +        Assert(td->smgr.is_temp);
> +        Assert(pgaio_io_get_owner(ioh) == MyProcNumber);
> +    }
> +    else
> +        Assert(!td->smgr.is_temp);
> +
> +    /*
> +     * Iterate over all the buffers affected by this IO and call appropriate
> +     * per-buffer completion function for each buffer.
> +     */
> +    io_data = pgaio_io_get_handle_data(ioh, &handle_data_len);
> +    for (uint8 buf_off = 0; buf_off < handle_data_len; buf_off++)
> +    {
> +        Buffer        buf = io_data[buf_off];
> +        PgAioResult buf_result;
> +        bool        failed;
> +
> +        Assert(BufferIsValid(buf));
> +
> +        /*
> +         * If the entire failed on a lower-level, each buffer needs to be

Missing word, probably fix like:
s,entire failed on a lower-level,entire I/O failed on a lower level,

> +         * marked as failed. In case of a partial read, some buffers may be
> +         * ok.
> +         */
> +        failed =
> +            prior_result.status == ARS_ERROR
> +            || prior_result.result <= buf_off;

I didn't run an experiment to check the following, but I think this should be
s/<=/</.  Suppose we requested two blocks and read some amount of bytes
[1*BLCKSZ, 2*BLSCKSZ - 1].  md_readv_complete will store result=1.  buf_off==0
should compute failed=false here, but buf_off==1 should compute failed=true.

I see this relies on md_readv_complete having converted "result" to blocks.
Was there some win from doing that as opposed to doing the division here?
Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
to follow, to me.

> +
> +        buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
> +                                               is_temp);
> +
> +        /*
> +         * If there wasn't any prior error and the IO for this page failed in
> +         * some form, set the whole IO's to the page's result.

s/the IO for this page/page verification/
s/IO's/IO's result/

> +         */
> +        if (result.status != ARS_ERROR && buf_result.status != ARS_OK)
> +        {
> +            result = buf_result;
> +            pgaio_result_report(result, td, LOG);
> +        }
> +    }
> +
> +    return result;
> +}



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Doc: Fixup misplaced filelist.sgml entities and add some commentary
Next
From: Nathan Bossart
Date:
Subject: Re: optimize file transfer in pg_upgrade