Thread: Re: AIO v2.0

Re: AIO v2.0

From

Andres Freund

Date:

16 September 2024, 20:51:42

Hi,

Thanks for the review!

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> > There's plenty more to do, but I thought this would be a useful checkpoint.
>
> I find patches 1-5 are Ready for Committer.

Cool!


> > +typedef enum PgAioHandleState
>
> This enum clarified a lot for me, so I wish I had read it before anything
> else.  I recommend referring to it in README.md.

Makes sense.


> Would you also cover the valid state transitions and which of them any
> backend can do vs. which are specific to the defining backend?

Yea, we should. I earlier had something, but because details were still
changing it was hard to keep up2date.


> > +{
> > +    /* not in use */
> > +    AHS_IDLE = 0,
> > +
> > +    /* returned by pgaio_io_get() */
> > +    AHS_HANDED_OUT,
> > +
> > +    /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
> > +    AHS_DEFINED,
> > +
> > +    /* subjects prepare() callback has been called */
> > +    AHS_PREPARED,
> > +
> > +    /* IO is being executed */
> > +    AHS_IN_FLIGHT,
>
> Let's align terms between functions and states those functions reach.  For
> example, I recommend calling this state AHS_SUBMITTED, because
> pgaio_io_prepare_submit() is the function reaching this state.
> (Alternatively, use in_flight in the function name.)

There used to be a separate SUBMITTED, but I removed it at some point as not
necessary anymore. Arguably it might be useful to re-introduce it so that
e.g. with worker mode one can tell the difference between the IO being queued
and the IO actually being processed.


> > +void
> > +pgaio_io_ref_wait(PgAioHandleRef *ior)
> > +{
> > +    uint64        ref_generation;
> > +    PgAioHandleState state;
> > +    bool        am_owner;
> > +    PgAioHandle *ioh;
> > +
> > +    ioh = pgaio_io_from_ref(ior, &ref_generation);
> > +
> > +    am_owner = ioh->owner_procno == MyProcNumber;
> > +
> > +
> > +    if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> > +        return;
> > +
> > +    if (am_owner)
> > +    {
> > +        if (state == AHS_DEFINED || state == AHS_PREPARED)
> > +        {
> > +            /* XXX: Arguably this should be prevented by callers? */
> > +            pgaio_submit_staged();
>
> Agreed for AHS_DEFINED, if not both.  AHS_DEFINED here would suggest a past
> longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.

That, or not having submitted the IO.  One thing I've been thinking about as
being potentially helpful infrastructure is to have something similar to a
critical section, except that it asserts that one is not allowed to block or
forget submitting staged IOs.



> > +void
> > +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> > +{
> > +    Assert(ioh->state == AHS_HANDED_OUT);
> > +    Assert(pgaio_io_has_subject(ioh));
> > +
> > +    ioh->op = op;
> > +    ioh->state = AHS_DEFINED;
> > +    ioh->result = 0;
> > +
> > +    /* allow a new IO to be staged */
> > +    my_aio->handed_out_io = NULL;
> > +
> > +    pgaio_io_prepare_subject(ioh);
> > +
> > +    ioh->state = AHS_PREPARED;
>
> As defense in depth, let's add a critical section from before assigning
> AHS_DEFINED to here.  This code already needs to be safe for that (per
> README.md).  When running outside a critical section, an ERROR in a subject
> callback could leak the lwlock disowned in shared_buffer_prepare_common().  I
> doubt there's a plausible way to reach that leak today, but future subject
> callbacks could add risk over time.

Makes sense.


> > +if test "$with_liburing" = yes; then
> > +  PKG_CHECK_MODULES(LIBURING, liburing)
> > +fi
>
> I used the attached makefile patch to build w/ liburing.

Thanks, will incorporate.


> With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
> fails early:

Right  - that's to be expected.

> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by
gcc(Debian 13.2.0-13) 13.2.0, 64-bit
 
> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  listening on Unix socket
"/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
> 2024-09-15 12:46:08.203 PDT startup[2069423] LOG:  database system was shut down at 2024-09-15 12:46:07 PDT
> 2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL:  the database system is starting up
> 2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG:  database system is ready to accept connections
> 2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC:  failed: -9/Bad file descriptor
> 2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC:  failed: -95/Operation not supported
> 2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC:  unexpected: -95/Operation not supported: No
suchfile or directory
 
> 2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG:  received fast shutdown request
>
> I expect that's from io_uring_queue_init() stashing in shared memory a file
> descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
> Reattaching descriptors and memory in each child may work, or one could just
> block io_method=io_uring under EXEC_BACKEND.

I think the latter option is saner - I don't think there's anything to be
gained by supporting io_uring in this situation. It's not like anybody will
use it for real-world workloads where performance matters. Nor would it be
useful fo portability testing.


> > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > +{

> > +        if (ret == -EINTR)
> > +        {
> > +            elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> > +            continue;
> > +        }
>
> Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
> retry on EAGAIN.  "man io_uring_enter" has:
>
>     EAGAIN The kernel was unable to allocate memory for the request, or
>     otherwise ran out of resources to handle it. The application should wait
>     for some completions and try again.

Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.

What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?


> > +FileStartWriteV(struct PgAioHandle *ioh, File file,
> > +                int iovcnt, off_t offset,
> > +                uint32 wait_event_info)
> > +{
> > ...
>
> FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
> name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.


> For non-sync IO methods, I gather it's essential that a process other than the
> IO definer be scanning for incomplete IOs and completing them.

Yep - it's something I've been fighting with / redesigning a *lot*.  Earlier
the AIO subsystem could transparently retry IOs, but that ends up being a
nightmare - or at least I couldn't find a way to not make it a
nightmare. There are two main complexities:

1) What if the IO is being completed in a critical section? We can't reopen
   the file in that situation. My initial fix for this was to defer retries,
   but that's problematic too:

2) Acquiring an IO needs to be able to guarantee forward progress. Because
   there's a limited number of IOs that means we need to be able to complete
   IOs while acquiring an IO. So we can't just keep the IO handle around -
   which in turn means that we'd need to save the state for retrying
   somewhere. Which would require some pre-allocated memory to save that
   state.

Thus I think it's actually better if we delegate retries to the callsites. I
was thinking that for partial reads of shared buffers we ought to not set
BM_IO_ERROR though...


> Otherwise, deadlocks like this would happen:

> backend1 locks blk1 for non-IO reasons
> backend2 locks blk2, starts AIO write
> backend1 waits for lock on blk2 for non-IO reasons
> backend2 waits for lock on blk1 for non-IO reasons
>
> If that's right, in worker mode, the IO worker resolves that deadlock.  What
> resolves it under io_uring?  Another process that happens to do
> pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
> make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.


> I could share more-tactical observations about patches 6-20, but they're
> probably things you'd change without those observations.

Agreed.


> Is there any specific decision you'd like to settle before patch 6 exits
> WIP?

Patch 6 specifically? That I really mainly kept separate for review - it
doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?

In case you mean 6+7 or 6 to ~11, I can think of the following:

- I am worried about the need for bounce buffers for writes of checksummed
  buffers. That quickly ends up being a significant chunk of memory,
  particularly when using a small shared_buffers with a higher than default
  number of connection. I'm currently hacking up a prototype that'd prevent us
  from setting hint bits with just a share lock. I'm planning to start a
  separate thread about that.

- The header split doesn't yet quite seem right yet

- I'd like to implement retries in the later patches, to make sure that it
  doesn't have design implications

- Worker mode needs to be able to automatically adjust the number of running
  workers, I think - otherwise it's going to be too hard to tune.

- I think the PgAioHandles need to be slimmed down a bit - there's some design
  evolution visible that should not end up in the tree.

- I'm not sure that I like name "subject" for the different things AIO is
  performed for

- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
  of buffer ids that are affected by one IO) could be replaced by repurposing
  BufferDesc->freeNext or something along those lines. I don't like the amount
  of memory required for storing those arrays, even if it's not that much
  compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
  handle.

- I'd like to extend the test module to actually test more cases, it's too
  hard to reach some paths, particularly without [a lot] of users yet. That's
  not strictly a dependency of the earlier patches - since the initial patches
  can't actually do much in the way of IO.

- We shouldn't reserve AioHandles etc for io workers - but because different
  tpes of aux processes don't use a predetermined ProcNumber, that's not
  entirely trivial without adding more complexity.  I've actually wondered
  whether IO workes should be their own "top-level" kind of process, rather
  than an aux process. But that seems quite costly.

- Right now the io_uring mode has each backend's io_uring instance visible to
  each other process. That ends up using a fair number of FDs. That's OK from
  an efficiency perspective, but I think we'd need to add code to adjust the
  soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
  various programs that iterate over all possible FDs, causing significant
  slowdowns when the soft limit defaults to something high).  I earlier had a
  limited number of io_uring instances, but that added a fair amount of
  overhead because then submitting IO would require a lock.

  That again doesn't have to be solved as part of the earlier patches but
  might have some minor design impact.


Thanks again,

Andres Freund

Re: AIO v2.0

From

Noah Misch

Date:

17 September 2024, 21:08:19

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

> > Reattaching descriptors and memory in each child may work, or one could just
> > block io_method=io_uring under EXEC_BACKEND.
> 
> I think the latter option is saner

Works for me.

> > > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > > +{
> 
> > > +        if (ret == -EINTR)
> > > +        {
> > > +            elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> > > +            continue;
> > > +        }
> >
> > Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
> > retry on EAGAIN.  "man io_uring_enter" has:
> >
> >     EAGAIN The kernel was unable to allocate memory for the request, or
> >     otherwise ran out of resources to handle it. The application should wait
> >     for some completions and try again.
> 
> Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
> in flight for each uring instance. That's different to a use of uring to
> e.g. wait for incoming network data on thousands of sockets, where you could
> have essentially unbounded amount of requests outstanding.
> 
> What would we wait for? What if we were holding a critical lock in that
> moment? Would it be safe to just block for some completions? What if there's
> actually no IO in progress?

I'd try the following.  First, scan for all IOs of all processes at
AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED.  This might be
unsafe today, but discovering why it's unsafe likely will inform design beyond
EAGAIN returns.  I don't specifically know of a way it's unsafe.  Do just one
pass of that; there may be newer IOs in progress afterward.  If submit still
gets EAGAIN, sleep a bit and retry.  Like we do in pgwin32_open_handle(), fail
after a fixed number of iterations.  This isn't great if we hold a critical
lock, but it beats the alternative of PANIC on the first EAGAIN.

> > > +FileStartWriteV(struct PgAioHandle *ioh, File file,
> > > +                int iovcnt, off_t offset,
> > > +                uint32 wait_event_info)
> > > +{
> > > ...
> >
> > FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
> > name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).
> 
> Hm - that doesn't necessarily seem right to me. I don't think the caller
> should assume that the IO will just be prepared and not already completed by
> the time FileStartWriteV() returns - we might actually do the IO
> synchronously.

Yes.  Even if it doesn't become synchronous IO, some other process may advance
the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined
the IO.  Still, I think this shouldn't use the term "Start" while no state
name uses that term.  What else could remove that mismatch?

> > Is there any specific decision you'd like to settle before patch 6 exits
> > WIP?
> 
> Patch 6 specifically? That I really mainly kept separate for review - it

No.  I'll rephrase as "Is there any specific decision you'd like to settle
before the next cohort of patches exits WIP?"

> doesn't seem particularly interesting to commit it earlier than 7, or do you
> think differently?

No, I agree a lone commit of 6 isn't a win.  Roughly, the eight patches
6-9,12-15 could be a minimal attractive unit.  I've not thought through that
grouping much.

> In case you mean 6+7 or 6 to ~11, I can think of the following:
> 
> - I am worried about the need for bounce buffers for writes of checksummed
>   buffers. That quickly ends up being a significant chunk of memory,
>   particularly when using a small shared_buffers with a higher than default
>   number of connection. I'm currently hacking up a prototype that'd prevent us
>   from setting hint bits with just a share lock. I'm planning to start a
>   separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker.  I recommend
dealing with the worry by reducing the limit initially (128 blocks?).  Can
always raise it later.

> - The header split doesn't yet quite seem right yet

I won't have a strong opinion on that one.  The aio.c/aio_io.c split did catch
my attention.  I made a note to check it again once those files get header
comments.

> - I'd like to implement retries in the later patches, to make sure that it
>   doesn't have design implications

Yes, that's a blocker to me.

> - Worker mode needs to be able to automatically adjust the number of running
>   workers, I think - otherwise it's going to be too hard to tune.

Changing that later wouldn't affect much else, so I'd not consider it a
blocker.  (The worst case is that we think the initial AIO release will be a
loss for most users, so we wrap it in debug_ terminology like we did for
debug_io_direct.  I'm not saying worker scaling will push AIO from one side of
that line to another, but that's why I'm fine with commits that omit
self-contained optimizations.)

> - I think the PgAioHandles need to be slimmed down a bit - there's some design
>   evolution visible that should not end up in the tree.

Okay.

> - I'm not sure that I like name "subject" for the different things AIO is
>   performed for

How about one of these six terms:

- listener, observer [if you view smgr as an observer of IOs in the sense of
https://en.wikipedia.org/wiki/Observer_pattern]
- class, subclass, type, tag [if you view an SmgrIO as a subclass of an IO, in the object-oriented sense]

> - I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
>   of buffer ids that are affected by one IO) could be replaced by repurposing
>   BufferDesc->freeNext or something along those lines. I don't like the amount
>   of memory required for storing those arrays, even if it's not that much
>   compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
>   handle.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.

> - I'd like to extend the test module to actually test more cases, it's too
>   hard to reach some paths, particularly without [a lot] of users yet. That's
>   not strictly a dependency of the earlier patches - since the initial patches
>   can't actually do much in the way of IO.

Agreed.  Among the post-patch check-world coverage, which uncovered parts have
the most risk?

> - We shouldn't reserve AioHandles etc for io workers - but because different
>   tpes of aux processes don't use a predetermined ProcNumber, that's not
>   entirely trivial without adding more complexity.  I've actually wondered
>   whether IO workes should be their own "top-level" kind of process, rather
>   than an aux process. But that seems quite costly.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.  Of these ones I'm calling non-blockers, which would you most
regret deferring?

> - Right now the io_uring mode has each backend's io_uring instance visible to
>   each other process. That ends up using a fair number of FDs. That's OK from
>   an efficiency perspective, but I think we'd need to add code to adjust the
>   soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
>   various programs that iterate over all possible FDs, causing significant

Agreed on raising the soft limit.  Docs and/or errhint() likely will need to
mention system configuration nonetheless, since some users will encounter
RLIMIT_MEMLOCK or /proc/sys/kernel/io_uring_disabled.

>   slowdowns when the soft limit defaults to something high).  I earlier had a
>   limited number of io_uring instances, but that added a fair amount of
>   overhead because then submitting IO would require a lock.
> 
>   That again doesn't have to be solved as part of the earlier patches but
>   might have some minor design impact.

How far do you see the design impact spreading on that one?

Thanks,
nm

Re: AIO v2.0

From

Andres Freund

Date:

30 September 2024, 17:49:17

Hi,

On 2024-09-17 11:08:19 -0700, Noah Misch wrote:
> > - I am worried about the need for bounce buffers for writes of checksummed
> >   buffers. That quickly ends up being a significant chunk of memory,
> >   particularly when using a small shared_buffers with a higher than default
> >   number of connection. I'm currently hacking up a prototype that'd prevent us
> >   from setting hint bits with just a share lock. I'm planning to start a
> >   separate thread about that.
> 
> AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
> Doing better is nice, but I don't consider this a blocker.  I recommend
> dealing with the worry by reducing the limit initially (128 blocks?).  Can
> always raise it later.

On storage that has nontrivial latency, like just about all cloud storage,
even 256 will be too low. Particularly for checkpointer.

Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
blocks in flight limits you to <= 256MByte/s, even on storage that can have a
lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.

Of course this could be addressed by tuning, but it seems like something that
shouldn't need to be tuned by the majority of folks running postgres.

We also discussed the topic at https://postgr.es/m/20240925020022.c5.nmisch%40google.com
> ... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
> decision.  From what I've heard so far of the performance effects, if it were
> me, I would keep the bounce buffers.  I'd pursue BM_SETTING_HINTS and bounce
> buffer removal as a distinct project after the main AIO capability.  Bounce
> buffers have an implementation.  They aren't harming other design decisions.
> The AIO project is big, so I'd want to err on the side of not designating
> other projects as its prerequisites.

Given the issues that modifying pages while in flight causes, not just with PG
level checksums, but also filesystem level checksum, I don't feel like it's a
particularly promising approach.

However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
to be completed before we can move forward with AIO. If I split out the write
portion from the read portion a bit further, the main AIO changes and the
shared-buffer read user can be merged before there's a dependency on the hint
bit stuff being done.

Does that seem reasonable?

Greetings,

Andres Freund

Re: AIO v2.0

From

Matthias van de Meent

Date:

30 September 2024, 18:55:37

On Mon, 30 Sept 2024 at 16:49, Andres Freund <andres@anarazel.de> wrote:
> On 2024-09-17 11:08:19 -0700, Noah Misch wrote:
> > > - I am worried about the need for bounce buffers for writes of checksummed
> > >   buffers. That quickly ends up being a significant chunk of memory,
> > >   particularly when using a small shared_buffers with a higher than default
> > >   number of connection. I'm currently hacking up a prototype that'd prevent us
> > >   from setting hint bits with just a share lock. I'm planning to start a
> > >   separate thread about that.
> >
> > AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
> > Doing better is nice, but I don't consider this a blocker.  I recommend
> > dealing with the worry by reducing the limit initially (128 blocks?).  Can
> > always raise it later.
>
> On storage that has nontrivial latency, like just about all cloud storage,
> even 256 will be too low. Particularly for checkpointer.
>
> Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
> blocks in flight limits you to <= 256MByte/s, even on storage that can have a
> lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.

FYI, I think you're off by a factor 8, i.e. that would be 2GB/sec and
666MB/sec respectively, given a normal page size of 8kB and exactly
1ms/3ms full round trip latency:

1 page/1 ms * 8kB/page * 256 concurrency = 256 pages/ms * 8kB/page =
2MiB/ms ~= 2GiB/sec.
for 3ms divide by 3 -> ~666MiB/sec.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: AIO v2.0

From

Noah Misch

Date:

30 September 2024, 22:29:53

On Mon, Sep 30, 2024 at 10:49:17AM -0400, Andres Freund wrote:
> We also discussed the topic at https://postgr.es/m/20240925020022.c5.nmisch%40google.com
> > ... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
> > decision.  From what I've heard so far of the performance effects, if it were
> > me, I would keep the bounce buffers.  I'd pursue BM_SETTING_HINTS and bounce
> > buffer removal as a distinct project after the main AIO capability.  Bounce
> > buffers have an implementation.  They aren't harming other design decisions.
> > The AIO project is big, so I'd want to err on the side of not designating
> > other projects as its prerequisites.
> 
> Given the issues that modifying pages while in flight causes, not just with PG
> level checksums, but also filesystem level checksum, I don't feel like it's a
> particularly promising approach.
> 
> However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
> to be completed before we can move forward with AIO. If I split out the write
> portion from the read portion a bit further, the main AIO changes and the
> shared-buffer read user can be merged before there's a dependency on the hint
> bit stuff being done.
> 
> Does that seem reasonable?

Yes.