Re: AIO v2.2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.2
Date
Msg-id j6hny5ivrfqw356ugoy3ti5ccadamluekxod4k6amao5snew6c@t5h3bwhrgfqx
Whole thread Raw
In response to Re: AIO v2.2  (Noah Misch <noah@leadboat.com>)
Responses Re: AIO v2.2
List pgsql-hackers
Hi,

On 2025-01-06 10:52:20 -0800, Noah Misch wrote:
> Patches 1 and 2 are still Ready for Committer.

I feel somewhat weird about pushing 0002 without a user, but I guess it's
still exercised, so it's probably fine...


> On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:
> > - The README has been extended with an overview of the API. I think it gives a
> >   good overview of how the API fits together.  I'd be very good to get
> >   feedback from folks that aren't as familiar with AIO, I can't really see
> >   what's easy/hard anymore.
>
> That's a helpful addition.  I've left inline comments on it, below.

Cool!


> > - More tests are needed. None of our current test frameworks really makes this
> >   easy :(.
>
> Which testing gap do you find most concerning?

Most of it isn't even AIO specific...


- temporary tables are rather poorly tested in general:
  - e.g. trivial to exceed the number of buffers, but our tests don't reach that

- We have pretty no testing for IO errors. We have a bit of coverage due to
  src/bin/pg_amcheck/t/003_check.pl, but that's for errors originating in
  bufmgr.c itself.

- no real testing of StartBufferIO's etc wait paths

- no testing for BM_PIN_COUNT_WAITER


I e.g. just noticed that the error handling for AIO on temp tables was broken
- but our tests never reach that:

  The bug exists due to temp tables not differentiating between "backend" pins
  and a "global pincount" - which means that there's no real way for the AIO
  subsystem to have a reference separate from the backend local pin -
  CheckForLocalBufferLeaks() complains about any leftover pins. It seems to
  works in non-assert mode, but with assertions transaction abort asserts out.


> I'd be most interested in the
> cases that would be undetected deadlocks under a naive design.  An example
> appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com

That's a good one, yea.

I think I'll try to translate the regression tests I wrote into an isolation
test, I hope that'll make it a bit easier to cover more cases.

And then we'll need more injection points, I'm afraid :(.


> > - Several folks asked for pg_stat_aio to come back, in "v1" that showed the
> >   set of currently in-flight AIOs.  That's not particularly hard - except
> >   that it doesn't really fit in the pg_stat_* namespace.
>
> Later message
> postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
> considering the name pg_aios.  Works for me.

Cool.


> > --- a/src/backend/storage/aio/aio.c
> > +++ b/src/backend/storage/aio/aio.c
> > @@ -3,6 +3,28 @@
> >   * aio.c
> >   *    AIO - Core Logic
> >   *
> > + * For documentation about how AIO works on a higher level, including a
> > + * schematic example, see README.md.
> > + *
> > + *
> > + * AIO is a complicated subsystem. To keep things navigable it is split across
> > + * a number of files:
> > + *
> > + * - aio.c - core AIO state handling
> > + *
> > + * - aio_init.c - initialization
> > + *
> > + * - aio_io.c - dealing with actual IO, including executing IOs synchronously
> > + *
> > + * - aio_subject.c - functionality related to executing IO for different
> > + *   subjects
> > + *
> > + * - method_*.c - different ways of executing AIO
> > + *
> > + * - read_stream.c - helper for accessing buffered relation data with
> > + *     look-ahead
> > + *
>
> I felt like some list entries in this new header comment largely restated the
> file name.  Here's how I'd write them to avoid that:

Thanks, adopting.


>  * - method_*.c - different ways of executing AIO (e.g. worker process)
>  * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
>  * - aio_subject.c - callbacks at IO operation lifecycle events
>  * - aio_init.c - per-fork and per-startup-process initialization

I don't particularly like "per-startup-process", because "global
initialization" really is separate (and precedes) from startup processes
startup. Maybe "per-server and per-backend initialization"?

>  * - aio.c - all other topics
>  * - read_stream.c - helper for reading buffered relation data

Did the order you listed the files have a system to it? If so, what is it?


> > --- /dev/null
> > +++ b/src/backend/storage/aio/README.md
> > @@ -0,0 +1,413 @@
> > +# Asynchronous & Direct IO
>
> I would move "### Why Asynchronous IO" to here; that's good background before
> getting into the example.

I moved the example back and forth when writing because different readers
would benefit from a different order and I couldn't quite decide.

So I'm happy to adjust based on your feedback...


> I might also move "### Why Direct / unbuffered IO" to here.  For me as a
> reader, I'd benefit from seeing things in this order:
>
> - "why"
> - condensed usage example like manpage SYNOPSIS, comments and decls removed
> - PgAioHandleState and discussion of valid transitions

Hm - why have PgAioHandleState and its states before the usage example?  Seems
like it'd be harder to understand that way.

> - usage example as it is, with full comments
> - the rest


> ## Synopsis
>
> ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
> pgaio_io_get_ref(ioh, &ior);
> pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
> pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
> smgrstartreadv(ioh, operation->smgr, forknum, blkno,
>                BufferGetBlock(buffer), 1);
> pgaio_submit_staged();
> pgaio_io_ref_wait(&ior);
> if (ioret.result.status == ARS_ERROR)
>     pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);

Happy to add this, but I'm not entirely sure if that's really that useful to
have without commentary? The synopsis in manpages is helpful because it
provides the signature of various functions, but this wouldn't...



> > +
> > +## AIO Usage Example
> > +
> > +In many cases code that can benefit from AIO does not directly have to
> > +interact with the AIO interface, but can use AIO via higher-level
> > +abstractions. See [Helpers](#helpers).
> > +
> > +In this example, a buffer will be read into shared buffers.
> > +
> > +```C
> > +/*
> > + * Result of the operation, only to be accessed in this backend.
> > + */
> > +PgAioReturn ioret;
> > +
> > +/*
> > + * Acquire AIO Handle, ioret will get result upon completion.
>
> Consider adding: from here to pgaio_submit_staged(), don't do [description of
> the kind of unacceptable blocking operations].

Hm.  Strictly speaking it's fine to block here, depending on whether
StartBufferIO() was already called.  I'll clarify.


> > +### IO can be started in critical sections
> ...
> > +The need to be able to execute IO in critical sections has substantial design
> > +implication on the AIO subsystem. Mainly because completing IOs (see prior
> > +section) needs to be possible within a critical section, even if the
> > +to-be-completed IO itself was not issued in a critical section. Consider
> > +e.g. the case of a backend first starting a number of writes from shared
> > +buffers and then starting to flush the WAL. Because only a limited amount of
> > +IO can be in-progress at the same time, initiating the IO for flushing the WAL
> > +may require to first finish executing IO executed earlier.
>
> The last line's two appearances of the word "execute" read awkwardly to me,
> and it's an opportunity to use PgAioHandleState terms.  Consider writing the
> last line like "may first advance an existing IO from AHS_PREPARED to
> AHS_COMPLETED_SHARED".

It is indeed awkward.  I don't love referencing the state-constants here
though, somehow that feels like a reference-cycle ;). What about this:

> ... Consider
> e.g. the case of a backend first starting a number of writes from shared
> buffers and then starting to flush the WAL. Because only a limited amount of
> IO can be in-progress at the same time, initiating IO for flushing the WAL may
> require to first complete IO that was started earlier.



> > +### AIO Callbacks
> ...
> > +In addition to completion, AIO callbacks also are called to "prepare" an
> > +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
> > +IO to/from shared buffers, which is required to handle the case where the
> > +issuing backend errors out and releases its own pins.
>
> Reading this, it's not obvious to me how to reconcile "finishing an IO could
> require pin acquisition" with "finishing an IO could happen in a critical
> section".  Pinning a buffer in a critical section sounds bad.  I vaguely
> recall understanding how it was okay as of my September review, but I've
> already forgotten.  Can this text have a sentence making that explicit?

Ah, yes, that's easy to misunderstand. The answer basically is that we don't
newly pin a buffer, we just increment the reference count by 1. That should
never fail.

How about:
> In addition to completion, AIO callbacks also are called to "prepare" an
> IO. This is, e.g., used to increase buffer reference counts to account for the
> AIO subsystem referencing the buffer, which is required to handle the case
> where the issuing backend errors out and releases its own pins while the IO is
> still ongoing.



> > +### AIO Subjects
> > +
> > +In addition to the completion callbacks describe above, each AIO Handle has
> > +exactly one "subject". Each subject has some space inside an AIO Handle with
> > +information specific to the subject and can provide callbacks to allow to
> > +reopen the underlying file (required for worker mode) and to describe the IO
> > +operation (used for debug logging and error messages).
>
> Can this say roughly how to decide when to add a new subject?

Hm, there obviously is some fuzziness. I was trying to get to some of that by
mentioning that the subject needs to know how to [re-]open a file and describe
the target of the IO in terms that make sense to the user.

E.g. smgr seemed to make sense as a subject as the smgr layer knows how to
open a file by delegating that to the layer below and the layer above just
knows about smgr, not md.c (or other potential smgr implementations).

The reason to keep this separate from the callbacks was that smgr IO going
through shared buffers, bypassing shared buffers and different smgr
implemenentations all could share the same subject implementation, even if
callbacks would differ between these use cases.


How about:

> I.e., if two different uses of AIO can describe the identity of the file being
> operated on the same way, it likely makes sense to use the same
> subject. E.g. different smgr implementations can describe IO with
> RelFileLocator, ForkNumber and BlockNumber and can thus share a subject. In
> contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
> and it would not make sense to use the same subject for smgr and WAL.




> Failing that, can it give examples of what additional subjects might exist
> if certain existing subsystems were to start using AIO?

I think the main ones I can think of are:

1) WAL logging

   This was implemented in v1. I'd guess that "real" WAL logging and
   initializing new WAL segments might use a different subject, but that's
   probably a question of taste.

2) "raw" file IO, for things that don't use the smgr abstraction. I could
   e.g. imagine using AIO in COPY to read / write the FROM/TO file or to
   implement CREATE DATABASE ... STRATEGY file_copy with AIO.

   This was used in v1, e.g. to implement the initial data directory sync
   after a crash. We do that on a filesystem level, not going through smgr
   etc.

3) FE/BE network IO



> > +### AIO Results
> > +
> > +As AIO completion callbacks
> > +[are executed in critical sections](#io-can-be-started-in-critical-sections)
> > +and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
> > +completion callbacks cannot be used to, e.g., make the query that triggered an
> > +IO ERROR out.
> > +
> > +To allow to react to failing IOs the issuing backend can pass a pointer to a
> > +`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
> > +`PgAioReturn` is filled with information about the IO. This includes
> > +information about whether the IO was successful (as a value of
> > +`PgAioResultStatus`) and enough information to raise an error in case of a
> > +failure (via `pgaio_result_log()`, with the error details encoded in
> > +`PgAioResult`).
>
> Can this have a sentence on how this fits in bounded shmem, given the lack of
> guarantees about a backend's responsiveness?  In other words, what makes it
> okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
> to AHS_COMPLETED_LOCAL?

I agree this should be explained somewhere - but not sure this is the best
place.

The reason it's ok is that each backend has a limited number of AIO handles
and if it runs out of IO handles we'll a) check if any IOs can be reclaimed b)
wait for the oldest IO to finish.


Thanks for the review!

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: allow changing autovacuum_max_workers without restarting
Next
From: Tomas Vondra
Date:
Subject: Re: Adjusting hash join memory limit to handle batch explosion