Re: AIO v2.0 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.0
Date
Msg-id 20240917180819.e1.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.0  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

> > Reattaching descriptors and memory in each child may work, or one could just
> > block io_method=io_uring under EXEC_BACKEND.
> 
> I think the latter option is saner

Works for me.

> > > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > > +{
> 
> > > +        if (ret == -EINTR)
> > > +        {
> > > +            elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> > > +            continue;
> > > +        }
> >
> > Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
> > retry on EAGAIN.  "man io_uring_enter" has:
> >
> >     EAGAIN The kernel was unable to allocate memory for the request, or
> >     otherwise ran out of resources to handle it. The application should wait
> >     for some completions and try again.
> 
> Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
> in flight for each uring instance. That's different to a use of uring to
> e.g. wait for incoming network data on thousands of sockets, where you could
> have essentially unbounded amount of requests outstanding.
> 
> What would we wait for? What if we were holding a critical lock in that
> moment? Would it be safe to just block for some completions? What if there's
> actually no IO in progress?

I'd try the following.  First, scan for all IOs of all processes at
AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED.  This might be
unsafe today, but discovering why it's unsafe likely will inform design beyond
EAGAIN returns.  I don't specifically know of a way it's unsafe.  Do just one
pass of that; there may be newer IOs in progress afterward.  If submit still
gets EAGAIN, sleep a bit and retry.  Like we do in pgwin32_open_handle(), fail
after a fixed number of iterations.  This isn't great if we hold a critical
lock, but it beats the alternative of PANIC on the first EAGAIN.

> > > +FileStartWriteV(struct PgAioHandle *ioh, File file,
> > > +                int iovcnt, off_t offset,
> > > +                uint32 wait_event_info)
> > > +{
> > > ...
> >
> > FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
> > name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).
> 
> Hm - that doesn't necessarily seem right to me. I don't think the caller
> should assume that the IO will just be prepared and not already completed by
> the time FileStartWriteV() returns - we might actually do the IO
> synchronously.

Yes.  Even if it doesn't become synchronous IO, some other process may advance
the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined
the IO.  Still, I think this shouldn't use the term "Start" while no state
name uses that term.  What else could remove that mismatch?

> > Is there any specific decision you'd like to settle before patch 6 exits
> > WIP?
> 
> Patch 6 specifically? That I really mainly kept separate for review - it

No.  I'll rephrase as "Is there any specific decision you'd like to settle
before the next cohort of patches exits WIP?"

> doesn't seem particularly interesting to commit it earlier than 7, or do you
> think differently?

No, I agree a lone commit of 6 isn't a win.  Roughly, the eight patches
6-9,12-15 could be a minimal attractive unit.  I've not thought through that
grouping much.

> In case you mean 6+7 or 6 to ~11, I can think of the following:
> 
> - I am worried about the need for bounce buffers for writes of checksummed
>   buffers. That quickly ends up being a significant chunk of memory,
>   particularly when using a small shared_buffers with a higher than default
>   number of connection. I'm currently hacking up a prototype that'd prevent us
>   from setting hint bits with just a share lock. I'm planning to start a
>   separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker.  I recommend
dealing with the worry by reducing the limit initially (128 blocks?).  Can
always raise it later.

> - The header split doesn't yet quite seem right yet

I won't have a strong opinion on that one.  The aio.c/aio_io.c split did catch
my attention.  I made a note to check it again once those files get header
comments.

> - I'd like to implement retries in the later patches, to make sure that it
>   doesn't have design implications

Yes, that's a blocker to me.

> - Worker mode needs to be able to automatically adjust the number of running
>   workers, I think - otherwise it's going to be too hard to tune.

Changing that later wouldn't affect much else, so I'd not consider it a
blocker.  (The worst case is that we think the initial AIO release will be a
loss for most users, so we wrap it in debug_ terminology like we did for
debug_io_direct.  I'm not saying worker scaling will push AIO from one side of
that line to another, but that's why I'm fine with commits that omit
self-contained optimizations.)

> - I think the PgAioHandles need to be slimmed down a bit - there's some design
>   evolution visible that should not end up in the tree.

Okay.

> - I'm not sure that I like name "subject" for the different things AIO is
>   performed for

How about one of these six terms:

- listener, observer [if you view smgr as an observer of IOs in the sense of
https://en.wikipedia.org/wiki/Observer_pattern]
- class, subclass, type, tag [if you view an SmgrIO as a subclass of an IO, in the object-oriented sense]

> - I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
>   of buffer ids that are affected by one IO) could be replaced by repurposing
>   BufferDesc->freeNext or something along those lines. I don't like the amount
>   of memory required for storing those arrays, even if it's not that much
>   compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
>   handle.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.

> - I'd like to extend the test module to actually test more cases, it's too
>   hard to reach some paths, particularly without [a lot] of users yet. That's
>   not strictly a dependency of the earlier patches - since the initial patches
>   can't actually do much in the way of IO.

Agreed.  Among the post-patch check-world coverage, which uncovered parts have
the most risk?

> - We shouldn't reserve AioHandles etc for io workers - but because different
>   tpes of aux processes don't use a predetermined ProcNumber, that's not
>   entirely trivial without adding more complexity.  I've actually wondered
>   whether IO workes should be their own "top-level" kind of process, rather
>   than an aux process. But that seems quite costly.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.  Of these ones I'm calling non-blockers, which would you most
regret deferring?

> - Right now the io_uring mode has each backend's io_uring instance visible to
>   each other process. That ends up using a fair number of FDs. That's OK from
>   an efficiency perspective, but I think we'd need to add code to adjust the
>   soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
>   various programs that iterate over all possible FDs, causing significant

Agreed on raising the soft limit.  Docs and/or errhint() likely will need to
mention system configuration nonetheless, since some users will encounter
RLIMIT_MEMLOCK or /proc/sys/kernel/io_uring_disabled.

>   slowdowns when the soft limit defaults to something high).  I earlier had a
>   limited number of io_uring instances, but that added a fair amount of
>   overhead because then submitting IO would require a lock.
> 
>   That again doesn't have to be solved as part of the earlier patches but
>   might have some minor design impact.

How far do you see the design impact spreading on that one?

Thanks,
nm



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Conflict detection for update_deleted in logical replication
Next
From: Masahiko Sawada
Date:
Subject: Re: Using per-transaction memory contexts for storing decoded tuples