Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id 7co5jtx2jkvpg5pde7f5wyub6w6zptfw4sbuj4n5j4iedj5zlg@u3mzxsgopnhl
Whole thread Raw
In response to Re: AIO v2.5  (Noah Misch <noah@leadboat.com>)
Responses Re: AIO v2.5
List pgsql-hackers
Hi,

On 2025-03-19 18:11:18 -0700, Noah Misch wrote:
> On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:
> > On 2025-03-19 14:25:30 -0700, Noah Misch wrote:

> > Hm, we retry more frequently that that if there are new connections...  Maybe
> > just "try again next time"?
> 
> Works for me.
> 

> > > And these individual lines from "git grep BM_IO_IN_PROGRESS":
> > >
> > >  *    i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
> > >
> > > The last especially.
> > 
> > Huh - yea.  This isn't a "new" issue, I think I missed this comment in 16's
> > 12f3867f5534.  I think the comment can just be deleted?
> 
> Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.
> 
> > >          * I/O already in progress.  We already hold BM_IO_IN_PROGRESS for the
> > >      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > >      * only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > 
> > > For the other three lines and the paragraph, the notion
> > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it or
> > > being the process "doing a read" becomes less significant when one process
> > > starts the IO and another completes it.
> > 
> > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe
> 
> Looking again, I agree they're okay.
> 
> > 
> >      * Now it's safe to write buffer to disk. Note that no one else should
> >      * have been able to write it while we were busy with log flushing because
> >      * we got the exclusive right to perform I/O by setting the
> >      * BM_IO_IN_PROGRESS bit.
> 
> That's fine too.  Maybe s/perform/stage/ or s/perform/start/.

I put these comment changes into their own patch, as it seemed confusing to
change them as part of one of the already queued commits.


> > > I see this relies on md_readv_complete having converted "result" to blocks.
> > > Was there some win from doing that as opposed to doing the division here?
> > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel easier
> > > to follow, to me.
> > 
> > It seemed like that would be wrong layering - what if we had an smgr that
> > could store data in a compressed format? The raw read would be of a smaller
> > size. The smgr API deals in BlockNumbers, only the md.c layer should know
> > about bytes.
> 
> I hadn't thought of that.  That's a good reason.

I thought that was better documented, but alas, it wasn't. How about updating
the documentation of smgrstartreadv to the following:

/*
 * smgrstartreadv() -- asynchronous version of smgrreadv()
 *
 * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
 * `ioh` all parameters are the same as smgrreadv().
 *
 * Completion callbacks above smgr will be passed the result as the number of
 * successfully read blocks if the read [partially] succeeds. This maintains
 * the abstraction that smgr operates on the level of blocks, rather than
 * bytes.
 */


I briefly had a bug in test_aio's injection point that lead to *increasing*
the number of bytes successfully read. That triggered an assertion failure in
bufmgr.c, but not closer to the problem.  Is it worth adding an assert against
that to md_readv_complete? Can't quite decide.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Next
From: "David G. Johnston"
Date:
Subject: Have postgres.bki self-identify