Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250404211802.fc.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Fri, Apr 04, 2025 at 03:16:18PM -0400, Andres Freund wrote:
> On 2025-04-03 12:40:23 -0700, Noah Misch wrote:
> > On Thu, Apr 03, 2025 at 02:19:43PM -0400, Andres Freund wrote:

> > In the general case, we could want client requests as follows:
> > 
> > - If completor==definer and has not dropped pin:
> >   - Make defined before verifying page.  That's all.  It might be cleaner to
> >     do this when first retrieving a return value from io_uring, since this
> >     just makes up for what Valgrind already does for readv().
> 
> Yea, I think it's better to do that in io_uring. It's what I have done in the
> attached.
> 
> 
> > - If completor!=definer or has dropped pin:
> >   - Make NOACCESS in definer when definer cedes its own pin.
> 
> That's the current behaviour for shared buffers, right?

Yes.

> >   - For io_method=worker, make UNDEFINED before starting readv().  It might be
> >     cleanest to do this when the worker first acts as the owner of the AIO
> >     subsystem pin, if that's a clear moment earlier than readv().
> 
> Hm, what do we need this for?

At the time, we likely didn't need it:

- If the worker does its own PinBuffer*()+unpin, we don't need it.  Those
  functions do the Valgrind client requests.
- If the worker relies on the AIO-subsystem-owned pin and does neither regular
  pin nor regular unpin, we don't need it.  Buffers are always "defined".
- If the worker relies on the AIO-subsystem-owned pin to skip PinBuffer*() but
  uses regular unpin code, then the buffer may be NOACCESS.  Then one would
  need this.  But this would be questionable for other reasons.

Your proposed change to set NOACCESS in buffer_readv_complete_one() interacts
with things further, making the UNDEFINED necessary.

> >   - Make DEFINED in completor before verifying page.  It might be cleaner to
> >     do this when the completor first retrieves a return value from io_uring,
> >     since this just makes up for what Valgrind already does for readv().
> 
> I think we can't rely on the marking during retrieving it from io_uring, as
> that might have happened in a different backend for a temp buffer. That'd only
> happen if we got io_uring events for *another* IO that involved a shared rel,
> but it can happen.

Good point.  I think the VALGRIND_MAKE_MEM_DEFINED() in
pgaio_uring_drain_locked() isn't currently needed at all.  If
completor-subxact==definer-subxact, PinBuffer() already did what Valgrind
needs.  Otherwise, buffer_readv_complete_one() does what Valgrind needs.

If that's right, it would still be nice to reach the right
VALGRIND_MAKE_MEM_DEFINED() without involving bufmgr.  That helps future,
non-bufmgr AIO use cases.  It's tricky to pick the right place for that
VALGRIND_MAKE_MEM_DEFINED():

- pgaio_uring_drain_locked() is problematic, I think.  In the localbuf case,
  the iovec base address is relevant only in the ioh-defining process.  In the
  shmem completor!=definer case, this runs only in the completor.

- A complete_local callback solves those problems.  However, if the
  AIO-defining subxact aborted, then we shouldn't set DEFINED at all, since
  the buffer mapping may have changed by the time of complete_local.

- Putting it in the place that would call pgaio_result_report(ERROR) if
  needed, e.g. ProcessReadBuffersResult(), solves the problem of the buffer
  mapping having moved.  ProcessReadBuffersResult() doesn't even need this,
  since PinBuffer() already did it.  Each future AIO use case will have a
  counterpart of ProcessReadBuffersResult() that consumes the result and
  proceeds with tasks that depend on the AIO.  That's the place.

Is that right?  I got this wrong a few times while trying to think through it,
so I'm not too confident in the above.

> > > Not quite sure if we should mark
> > > the entire IOV is efined or just the portion that was actually read - the
> > > latter is additional fiddly code, and it's not clear it's likely to be helpful?
> > 
> > Seems fine to do the simpler way if that saves fiddly code.
> 
> Can't quite decide, it's just at the border of what I consider too
> fiddly... See the change to method_io_uring.c in the attached patch.

It is at the border, as you say, but I'd tend to keep it.


> Subject: [PATCH v1 1/3] localbuf: Add Valgrind buffer access instrumentation

Ready for commit


> Subject: [PATCH v1 2/3] aio: Make AIO compatible with valgrind

See above about pgaio_uring_drain_locked().

> related code until it is pinned bu "user" code again. But it requires some

s/bu/by/

> + * Return the iovecand its length. Currently only expected to be used by

s/iovecand/iovec and/

> @@ -361,13 +405,16 @@ pgaio_uring_drain_locked(PgAioUringContext *context)
>          for (int i = 0; i < ncqes; i++)
>          {
>              struct io_uring_cqe *cqe = cqes[i];
> +            int32        res;
>              PgAioHandle *ioh;
>  
>              ioh = io_uring_cqe_get_data(cqe);
>              errcallback.arg = ioh;
> +            res = cqe->res;
> +
>              io_uring_cqe_seen(&context->io_uring_ring, cqe);
>  
> -            pgaio_io_process_completion(ioh, cqe->res);
> +            pgaio_uring_io_process_completion(ioh, res);

I guess this is a distinct cleanup, done to avoid any suspicion of cqe being
reused asynchronously after io_uring_cqe_seen().  Is that right?


> Subject: [PATCH v1 3/3] aio: Avoid spurious coverity warning

Ready for commit



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Fundamental scheduling bug in parallel restore of partitioned tables
Next
From: Heikki Linnakangas
Date:
Subject: Re: Proposal: Limitations of palloc inside checkpointer