Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id cz6eu7prqdibz35nljpqwyzyp37afncsp4r3fk4lkssiecdjck@vwew2c2vs2ar
Whole thread Raw
In response to Re: AIO v2.5  (Noah Misch <noah@leadboat.com>)
Responses Re: AIO v2.5
Re: AIO v2.5
List pgsql-hackers
Hi,

On 2025-03-19 14:25:30 -0700, Noah Misch wrote:
> On Wed, Mar 12, 2025 at 01:06:03PM -0400, Andres Freund wrote:
> > - Right now effective_io_concurrency cannot be set > 0 on Windows and other
> >   platforms that lack posix_fadvise. But with AIO we can read ahead without
> >   posix_fadvise().
> >
> >   It'd not really make anything worse than today to not remove the limit, but
> >   it'd be pretty weird to prevent windows etc from benefiting from AIO.  Need
> >   to look around and see whether it would require anything other than doc
> >   changes.
>
> Worth changing, but non-blocking.

Thankfully Melanie submitted a patch for that...


> Other than the smgr patch review sent on its own thread, I've not yet reviewed
> any of these patches comprehensively.  Given the speed of change, I felt it
> was time to flush comments buffered since 2025-03-11:

Thanks!


> commit 0284401 wrote:
> >     aio: Basic subsystem initialization
>
> > @@ -465,6 +466,7 @@ AutoVacLauncherMain(const void *startup_data, size_t startup_data_len)
> >           */
> >          LWLockReleaseAll();
> >          pgstat_report_wait_end();
> > +        pgaio_error_cleanup();
>
> AutoVacLauncherMain(), BackgroundWriterMain(), CheckpointerMain(), and
> WalWriterMain() call AtEOXact_Buffers() but not AtEOXact_Aio().  Is that
> proper?  They do call pgaio_error_cleanup() as seen here, so the only loss is
> some asserts.  (The load-bearing part does get done.)

I don't think it's particularly good that we use the AtEOXact_* functions in
the sigsetjmp blocks, that feels like a weird mixup of infrastructure to
me. So this was intentional.


> commit da72269 wrote:
> >     aio: Add core asynchronous I/O infrastructure
>
> > + * This could be in aio_internal.h, as it is not pubicly referenced, but
>
> typo -> publicly

/me has a red face.


> commit 55b454d wrote:
> >     aio: Infrastructure for io_method=worker
>
> > +        /* Try to launch one. */
> > +        child = StartChildProcess(B_IO_WORKER);
> > +        if (child != NULL)
> > +        {
> > +            io_worker_children[id] = child;
> > +            ++io_worker_count;
> > +        }
> > +        else
> > +            break;                /* XXX try again soon? */
>
> I'd change the comment to something like one of:
>
>   retry after DetermineSleepTime()
>   next LaunchMissingBackgroundProcesses() will retry in <60s

Hm, we retry more frequently that that if there are new connections...  Maybe
just "try again next time"?


> On Tue, Mar 18, 2025 at 04:12:18PM -0400, Andres Freund wrote:
> > - Decide what to do about the smgr interrupt issue
>
> Replied on that thread.  It's essentially ready.

Cool, will reply there in a bit.


> > Subject: [PATCH v2.10 08/28] bufmgr: Implement AIO read support
>
> Some comments about BM_IO_IN_PROGRESS may need updates.  This paragraph:
>
> * The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
> buffer to complete (and in releases before 14, it was accompanied by a
> per-buffer LWLock).  The process doing a read or write sets the flag for the
> duration, and processes that need to wait for it to be cleared sleep on a
> condition variable.

First draft:
* The BM_IO_IN_PROGRESS flag acts as a kind of lock, used to wait for I/O on a
buffer to complete (and in releases before 14, it was accompanied by a
per-buffer LWLock).  The process start a read or write sets the flag. When the
I/O is completed, be it by the process that initiated the I/O or by another
process, the flag is removed and the Buffer's condition variable is signalled.
Processes that need to wait for the I/O to complete can wait for asynchronous
I/O to using BufferDesc->io_wref and for BM_IO_IN_PROGRESS to be unset by
sleeping on the buffer's condition variable.


> And these individual lines from "git grep BM_IO_IN_PROGRESS":
>
>  *    i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
>
> The last especially.

Huh - yea.  This isn't a "new" issue, I think I missed this comment in 16's
12f3867f5534.  I think the comment can just be deleted?


>          * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
>      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
>      * only one process at a time can set the BM_IO_IN_PROGRESS bit.

> For the other three lines and the paragraph, the notion
> of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> being the process "doing a read" becomes less significant when one process
> starts the IO and another completes it.

Hm. I think they'd be ok as-is, but we can probably improve them. Maybe


     * Now it's safe to write buffer to disk. Note that no one else should
     * have been able to write it while we were busy with log flushing because
     * we got the exclusive right to perform I/O by setting the
     * BM_IO_IN_PROGRESS bit.



> > +        /* we better have ensured the buffer is present until now */
> > +        Assert(BUF_STATE_GET_REFCOUNT(buf_state) >= 1);
>
> I'd delete that comment; to me, the assertion alone is clearer.

Ok.


> > +            ereport(LOG,
> > +                    (errcode(ERRCODE_DATA_CORRUPTED),
> > +                     errmsg("invalid page in block %u of relation %s; zeroing out page",
>
> This is changing level s/WARNING/LOG/.  That seems orthogonal to the patch's
> goals; is it needed?  If so, I recommend splitting it out as a preliminary
> patch, to highlight the behavior change for release notes.

No, it's not needed. I think I looked over the patch at some point and
considered the log-level wrong according to our guidelines and thought I'd
broken it.


> > +        /*
> > +         * If the entire failed on a lower-level, each buffer needs to be
>
> Missing word, probably fix like:
> s,entire failed on a lower-level,entire I/O failed on a lower level,


Yep.


> > +         * marked as failed. In case of a partial read, some buffers may be
> > +         * ok.
> > +         */
> > +        failed =
> > +            prior_result.status == ARS_ERROR
> > +            || prior_result.result <= buf_off;
>
> I didn't run an experiment to check the following, but I think this should be
> s/<=/</.  Suppose we requested two blocks and read some amount of bytes
> [1*BLCKSZ, 2*BLSCKSZ - 1].  md_readv_complete will store result=1.  buf_off==0
> should compute failed=false here, but buf_off==1 should compute failed=true.

Huh, you might be right. I thought I wrote a test for this, I wonder why it
didn't catch the problem...


> I see this relies on md_readv_complete having converted "result" to blocks.
> Was there some win from doing that as opposed to doing the division here?
> Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> to follow, to me.

It seemed like that would be wrong layering - what if we had an smgr that
could store data in a compressed format? The raw read would be of a smaller
size. The smgr API deals in BlockNumbers, only the md.c layer should know
about bytes.


> > +
> > +        buf_result = buffer_readv_complete_one(buf_off, buf, cb_data, failed,
> > +                                               is_temp);
> > +
> > +        /*
> > +         * If there wasn't any prior error and the IO for this page failed in
> > +         * some form, set the whole IO's to the page's result.
>
> s/the IO for this page/page verification/
> s/IO's/IO's result/

Agreed.

Thanks for the review!

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER
Next
From: Jeff Davis
Date:
Subject: Re: Statistics Import and Export