Re: AIO v2.2 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.2
Date
Msg-id 20250106185220.c9.nmisch@google.com
Whole thread Raw
Responses Re: AIO v2.2
List pgsql-hackers
Patches 1 and 2 are still Ready for Committer.

On Tue, Dec 31, 2024 at 11:03:33PM -0500, Andres Freund wrote:
> - The README has been extended with an overview of the API. I think it gives a
>   good overview of how the API fits together.  I'd be very good to get
>   feedback from folks that aren't as familiar with AIO, I can't really see
>   what's easy/hard anymore.

That's a helpful addition.  I've left inline comments on it, below.

> The biggest TODOs are:
> 
> - Right now the API between bufmgr.c and read_stream.c kind of necessitates
>   that one StartReadBuffers() call actually can trigger multiple IOs, if
>   one of the buffers was read in by another backend, before "this" backend
>   called StartBufferIO().
> 
>   I think Thomas and I figured out a way to evolve the interface so that this
>   isn't necessary anymore:
> 
>   We allow StartReadBuffers() to memorize buffers it pinned but didn't
>   initiate IO on in the buffers[] argument. The next call to StartReadBuffers
>   then doesn't have to repin thse buffers. That doesn't just solve the
>   multiple-IOs for one "read operation" issue, it also make the - very common
>   - case of a bunch of "buffer misses" followed by a "buffer hit" cleaner, the
>   hit wouldn't be tracked in the same ReadBuffersOperation anymore.

That sounds reasonable.

> - Right now bufmgr.h includes aio.h, because it needs to include a reference
>   to the AIO's result in ReadBuffersOperation. Requiring a dynamic allocation
>   would be noticeable overhead, so that's not an option. I think the best
>   option here would be to introduce something like aio_types.h, so fewer
>   things are included.

That sounds fine.  Header splits aren't going to be perfect, so I'd pick
something (e.g. your proposal here) and move on.

> - There's no obvious way to tell "internal" function operating on an IO handle
>   apart from functions that are expected to be called by the issuer of an IO.
> 
>   One way to deal with this would be to introduce a distinct "issuer IO
>   reference" type.  I think that might be a good idea, it would also make it
>   clearer that a good number of the functions can only be called by the
>   issuer, before the IO is submitted.

That's reasonable, albeit non-critical.

> - The naming around PgAioReturn, PgAioResult, PgAioResultStatus needs to be
>   improved

POSIX uses the word "result" for the consequences of a function (e.g. the
result of unlink() is readdir() no longer finding the link).  It uses the word
"return" for a memory value that describes a result.  In that usage, the
struct currently called PgAioResult would be a Return.  The struct currently
called PgAioReturn is PgAioResult plus the data to identify the IO.  Possible
name changes:

  PgAioResult -> PgAioReturn
  PgAioReturn -> PgAioReturnIdentified | PgAioReturnID | PgAioReturnTagged [I don't love these]
  PgAioResultStatus -> PgAioStatus | PgAioFill

That said, I don't dislike the existing names and would not have raised the
topic myself.

> - The debug logging functions are a bit of a mess, lots of very similar code
>   in lots of places. I think AIO needs a few ereport() wrappers to make this
>   easier.

May as well.

> - More tests are needed. None of our current test frameworks really makes this
>   easy :(.

Which testing gap do you find most concerning?  I'd be most interested in the
cases that would be undetected deadlocks under a naive design.  An example
appeared at the end of postgr.es/m/20240916144349.74.nmisch@google.com

> - Several folks asked for pg_stat_aio to come back, in "v1" that showed the
>   set of currently in-flight AIOs.  That's not particularly hard - except
>   that it doesn't really fit in the pg_stat_* namespace.

Later message
postgr.es/m/6vjl6jeaqvyhfbpgwziypwmhem2rwla4o5pgpuxwtg3o3o3jb5@evyzorb5meth is
considering the name pg_aios.  Works for me.

> --- a/src/backend/storage/aio/aio.c
> +++ b/src/backend/storage/aio/aio.c
> @@ -3,6 +3,28 @@
>   * aio.c
>   *    AIO - Core Logic
>   *
> + * For documentation about how AIO works on a higher level, including a
> + * schematic example, see README.md.
> + *
> + *
> + * AIO is a complicated subsystem. To keep things navigable it is split across
> + * a number of files:
> + *
> + * - aio.c - core AIO state handling
> + *
> + * - aio_init.c - initialization
> + *
> + * - aio_io.c - dealing with actual IO, including executing IOs synchronously
> + *
> + * - aio_subject.c - functionality related to executing IO for different
> + *   subjects
> + *
> + * - method_*.c - different ways of executing AIO
> + *
> + * - read_stream.c - helper for accessing buffered relation data with
> + *     look-ahead
> + *

I felt like some list entries in this new header comment largely restated the
file name.  Here's how I'd write them to avoid that:

 * - method_*.c - different ways of executing AIO (e.g. worker process)
 * - aio_io.c - method-independent code for specific IO ops (e.g. readv)
 * - aio_subject.c - callbacks at IO operation lifecycle events
 * - aio_init.c - per-fork and per-startup-process initialization
 * - aio.c - all other topics
 * - read_stream.c - helper for reading buffered relation data

> --- /dev/null
> +++ b/src/backend/storage/aio/README.md
> @@ -0,0 +1,413 @@
> +# Asynchronous & Direct IO

I would move "### Why Asynchronous IO" to here; that's good background before
getting into the example.  I might also move "### Why Direct / unbuffered IO"
to here.  For me as a reader, I'd benefit from seeing things in this order:

- "why"
- condensed usage example like manpage SYNOPSIS, comments and decls removed
- PgAioHandleState and discussion of valid transitions
- usage example as it is, with full comments
- the rest

In other words, like this:

# Asynchronous & Direct IO

## Motivation

### Why Asynchronous IO

[existing content moved from lower in the file]

## Synopsis

ioh = pgaio_io_get(CurrentResourceOwner, &ioret);
pgaio_io_get_ref(ioh, &ior);
pgaio_io_add_shared_cb(ioh, ASC_SHARED_BUFFER_READ);
pgaio_io_set_io_data_32(ioh, (uint32 *) buffer, 1);
smgrstartreadv(ioh, operation->smgr, forknum, blkno,
               BufferGetBlock(buffer), 1);
pgaio_submit_staged();
pgaio_io_ref_wait(&ior);
if (ioret.result.status == ARS_ERROR)
    pgaio_result_log(aio_ret.result, &aio_ret.subject_data, ERROR);

## I/O Operation States & Transitions

[PgAioHandleState and its transitions]

## AIO Usage Example

[your content:]

> +
> +## AIO Usage Example
> +
> +In many cases code that can benefit from AIO does not directly have to
> +interact with the AIO interface, but can use AIO via higher-level
> +abstractions. See [Helpers](#helpers).
> +
> +In this example, a buffer will be read into shared buffers.
> +
> +```C
> +/*
> + * Result of the operation, only to be accessed in this backend.
> + */
> +PgAioReturn ioret;
> +
> +/*
> + * Acquire AIO Handle, ioret will get result upon completion.

Consider adding: from here to pgaio_submit_staged(), don't do [description of
the kind of unacceptable blocking operations].

> + * Once the IO handle has been handed of, it may not further be used, as the

s/of/off/

> +### IO can be started in critical sections
...
> +The need to be able to execute IO in critical sections has substantial design
> +implication on the AIO subsystem. Mainly because completing IOs (see prior
> +section) needs to be possible within a critical section, even if the
> +to-be-completed IO itself was not issued in a critical section. Consider
> +e.g. the case of a backend first starting a number of writes from shared
> +buffers and then starting to flush the WAL. Because only a limited amount of
> +IO can be in-progress at the same time, initiating the IO for flushing the WAL
> +may require to first finish executing IO executed earlier.

The last line's two appearances of the word "execute" read awkwardly to me,
and it's an opportunity to use PgAioHandleState terms.  Consider writing the
last line like "may first advance an existing IO from AHS_PREPARED to
AHS_COMPLETED_SHARED".

> +ASLR. This means that the shared memory cannot contain pointer to callbacks.

s/pointer/pointers/

> +### AIO Callbacks
...
> +In addition to completion, AIO callbacks also are called to "prepare" an
> +IO. This is, e.g., used to acquire buffer pins owned by the AIO subsystem for
> +IO to/from shared buffers, which is required to handle the case where the
> +issuing backend errors out and releases its own pins.

Reading this, it's not obvious to me how to reconcile "finishing an IO could
require pin acquisition" with "finishing an IO could happen in a critical
section".  Pinning a buffer in a critical section sounds bad.  I vaguely
recall understanding how it was okay as of my September review, but I've
already forgotten.  Can this text have a sentence making that explicit?

> +### AIO Subjects
> +
> +In addition to the completion callbacks describe above, each AIO Handle has
> +exactly one "subject". Each subject has some space inside an AIO Handle with
> +information specific to the subject and can provide callbacks to allow to
> +reopen the underlying file (required for worker mode) and to describe the IO
> +operation (used for debug logging and error messages).

Can this say roughly how to decide when to add a new subject?  Failing that,
can it give examples of what additional subjects might exist if certain
existing subsystems were to start using AIO?

> +### AIO Results
> +
> +As AIO completion callbacks
> +[are executed in critical sections](#io-can-be-started-in-critical-sections)
> +and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
> +completion callbacks cannot be used to, e.g., make the query that triggered an
> +IO ERROR out.
> +
> +To allow to react to failing IOs the issuing backend can pass a pointer to a
> +`PgAioReturn` in backend local memory. Before an AIO Handle is reused the
> +`PgAioReturn` is filled with information about the IO. This includes
> +information about whether the IO was successful (as a value of
> +`PgAioResultStatus`) and enough information to raise an error in case of a
> +failure (via `pgaio_result_log()`, with the error details encoded in
> +`PgAioResult`).

Can this have a sentence on how this fits in bounded shmem, given the lack of
guarantees about a backend's responsiveness?  In other words, what makes it
okay to have requests take arbitrarily long to move from AHS_COMPLETED_SHARED
to AHS_COMPLETED_LOCAL?

Thanks,
nm



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Next
From: Noah Misch
Date:
Subject: Re: Converting contrib SQL functions to new style