Re: AIO v2.0 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.0
Date
Msg-id 237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf
Whole thread Raw
Responses Re: AIO v2.0
List pgsql-hackers
Hi,

Thanks for the review!

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> > There's plenty more to do, but I thought this would be a useful checkpoint.
>
> I find patches 1-5 are Ready for Committer.

Cool!


> > +typedef enum PgAioHandleState
>
> This enum clarified a lot for me, so I wish I had read it before anything
> else.  I recommend referring to it in README.md.

Makes sense.


> Would you also cover the valid state transitions and which of them any
> backend can do vs. which are specific to the defining backend?

Yea, we should. I earlier had something, but because details were still
changing it was hard to keep up2date.


> > +{
> > +    /* not in use */
> > +    AHS_IDLE = 0,
> > +
> > +    /* returned by pgaio_io_get() */
> > +    AHS_HANDED_OUT,
> > +
> > +    /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
> > +    AHS_DEFINED,
> > +
> > +    /* subjects prepare() callback has been called */
> > +    AHS_PREPARED,
> > +
> > +    /* IO is being executed */
> > +    AHS_IN_FLIGHT,
>
> Let's align terms between functions and states those functions reach.  For
> example, I recommend calling this state AHS_SUBMITTED, because
> pgaio_io_prepare_submit() is the function reaching this state.
> (Alternatively, use in_flight in the function name.)

There used to be a separate SUBMITTED, but I removed it at some point as not
necessary anymore. Arguably it might be useful to re-introduce it so that
e.g. with worker mode one can tell the difference between the IO being queued
and the IO actually being processed.


> > +void
> > +pgaio_io_ref_wait(PgAioHandleRef *ior)
> > +{
> > +    uint64        ref_generation;
> > +    PgAioHandleState state;
> > +    bool        am_owner;
> > +    PgAioHandle *ioh;
> > +
> > +    ioh = pgaio_io_from_ref(ior, &ref_generation);
> > +
> > +    am_owner = ioh->owner_procno == MyProcNumber;
> > +
> > +
> > +    if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> > +        return;
> > +
> > +    if (am_owner)
> > +    {
> > +        if (state == AHS_DEFINED || state == AHS_PREPARED)
> > +        {
> > +            /* XXX: Arguably this should be prevented by callers? */
> > +            pgaio_submit_staged();
>
> Agreed for AHS_DEFINED, if not both.  AHS_DEFINED here would suggest a past
> longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.

That, or not having submitted the IO.  One thing I've been thinking about as
being potentially helpful infrastructure is to have something similar to a
critical section, except that it asserts that one is not allowed to block or
forget submitting staged IOs.



> > +void
> > +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> > +{
> > +    Assert(ioh->state == AHS_HANDED_OUT);
> > +    Assert(pgaio_io_has_subject(ioh));
> > +
> > +    ioh->op = op;
> > +    ioh->state = AHS_DEFINED;
> > +    ioh->result = 0;
> > +
> > +    /* allow a new IO to be staged */
> > +    my_aio->handed_out_io = NULL;
> > +
> > +    pgaio_io_prepare_subject(ioh);
> > +
> > +    ioh->state = AHS_PREPARED;
>
> As defense in depth, let's add a critical section from before assigning
> AHS_DEFINED to here.  This code already needs to be safe for that (per
> README.md).  When running outside a critical section, an ERROR in a subject
> callback could leak the lwlock disowned in shared_buffer_prepare_common().  I
> doubt there's a plausible way to reach that leak today, but future subject
> callbacks could add risk over time.

Makes sense.


> > +if test "$with_liburing" = yes; then
> > +  PKG_CHECK_MODULES(LIBURING, liburing)
> > +fi
>
> I used the attached makefile patch to build w/ liburing.

Thanks, will incorporate.


> With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
> fails early:

Right  - that's to be expected.

> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by
gcc(Debian 13.2.0-13) 13.2.0, 64-bit
 
> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  listening on Unix socket
"/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
> 2024-09-15 12:46:08.203 PDT startup[2069423] LOG:  database system was shut down at 2024-09-15 12:46:07 PDT
> 2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL:  the database system is starting up
> 2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG:  database system is ready to accept connections
> 2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC:  failed: -9/Bad file descriptor
> 2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC:  failed: -95/Operation not supported
> 2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC:  unexpected: -95/Operation not supported: No
suchfile or directory
 
> 2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG:  received fast shutdown request
>
> I expect that's from io_uring_queue_init() stashing in shared memory a file
> descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
> Reattaching descriptors and memory in each child may work, or one could just
> block io_method=io_uring under EXEC_BACKEND.

I think the latter option is saner - I don't think there's anything to be
gained by supporting io_uring in this situation. It's not like anybody will
use it for real-world workloads where performance matters. Nor would it be
useful fo portability testing.


> > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > +{

> > +        if (ret == -EINTR)
> > +        {
> > +            elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> > +            continue;
> > +        }
>
> Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
> retry on EAGAIN.  "man io_uring_enter" has:
>
>     EAGAIN The kernel was unable to allocate memory for the request, or
>     otherwise ran out of resources to handle it. The application should wait
>     for some completions and try again.

Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.

What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?


> > +FileStartWriteV(struct PgAioHandle *ioh, File file,
> > +                int iovcnt, off_t offset,
> > +                uint32 wait_event_info)
> > +{
> > ...
>
> FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
> name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.


> For non-sync IO methods, I gather it's essential that a process other than the
> IO definer be scanning for incomplete IOs and completing them.

Yep - it's something I've been fighting with / redesigning a *lot*.  Earlier
the AIO subsystem could transparently retry IOs, but that ends up being a
nightmare - or at least I couldn't find a way to not make it a
nightmare. There are two main complexities:

1) What if the IO is being completed in a critical section? We can't reopen
   the file in that situation. My initial fix for this was to defer retries,
   but that's problematic too:

2) Acquiring an IO needs to be able to guarantee forward progress. Because
   there's a limited number of IOs that means we need to be able to complete
   IOs while acquiring an IO. So we can't just keep the IO handle around -
   which in turn means that we'd need to save the state for retrying
   somewhere. Which would require some pre-allocated memory to save that
   state.

Thus I think it's actually better if we delegate retries to the callsites. I
was thinking that for partial reads of shared buffers we ought to not set
BM_IO_ERROR though...


> Otherwise, deadlocks like this would happen:

> backend1 locks blk1 for non-IO reasons
> backend2 locks blk2, starts AIO write
> backend1 waits for lock on blk2 for non-IO reasons
> backend2 waits for lock on blk1 for non-IO reasons
>
> If that's right, in worker mode, the IO worker resolves that deadlock.  What
> resolves it under io_uring?  Another process that happens to do
> pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
> make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.


> I could share more-tactical observations about patches 6-20, but they're
> probably things you'd change without those observations.

Agreed.


> Is there any specific decision you'd like to settle before patch 6 exits
> WIP?

Patch 6 specifically? That I really mainly kept separate for review - it
doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?

In case you mean 6+7 or 6 to ~11, I can think of the following:

- I am worried about the need for bounce buffers for writes of checksummed
  buffers. That quickly ends up being a significant chunk of memory,
  particularly when using a small shared_buffers with a higher than default
  number of connection. I'm currently hacking up a prototype that'd prevent us
  from setting hint bits with just a share lock. I'm planning to start a
  separate thread about that.

- The header split doesn't yet quite seem right yet

- I'd like to implement retries in the later patches, to make sure that it
  doesn't have design implications

- Worker mode needs to be able to automatically adjust the number of running
  workers, I think - otherwise it's going to be too hard to tune.

- I think the PgAioHandles need to be slimmed down a bit - there's some design
  evolution visible that should not end up in the tree.

- I'm not sure that I like name "subject" for the different things AIO is
  performed for

- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
  of buffer ids that are affected by one IO) could be replaced by repurposing
  BufferDesc->freeNext or something along those lines. I don't like the amount
  of memory required for storing those arrays, even if it's not that much
  compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
  handle.

- I'd like to extend the test module to actually test more cases, it's too
  hard to reach some paths, particularly without [a lot] of users yet. That's
  not strictly a dependency of the earlier patches - since the initial patches
  can't actually do much in the way of IO.

- We shouldn't reserve AioHandles etc for io workers - but because different
  tpes of aux processes don't use a predetermined ProcNumber, that's not
  entirely trivial without adding more complexity.  I've actually wondered
  whether IO workes should be their own "top-level" kind of process, rather
  than an aux process. But that seems quite costly.

- Right now the io_uring mode has each backend's io_uring instance visible to
  each other process. That ends up using a fair number of FDs. That's OK from
  an efficiency perspective, but I think we'd need to add code to adjust the
  soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
  various programs that iterate over all possible FDs, causing significant
  slowdowns when the soft limit defaults to something high).  I earlier had a
  limited number of io_uring instances, but that added a fair amount of
  overhead because then submitting IO would require a lock.

  That again doesn't have to be solved as part of the earlier patches but
  might have some minor design impact.


Thanks again,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Document DateStyle effect on jsonpath string()
Next
From: Andres Freund
Date:
Subject: Re: AIO v2.0