Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250329212929.a6.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
Re: AIO v2.5
List pgsql-hackers
Flushing half-baked review comments before going offline for a few hours:

On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote:
> Attached v2.13, with the following changes:

>   5) The WARNING in the callback is now a LOG, as it will be sent to the
>      client as a WARNING explicitly when the IO's results are processed
> 
>      I actually chose LOG_SERVER_ONLY - that seemed slightly better than just
>      LOG? But not at all sure.

LOG_SERVER_ONLY and its synonym COMMERR look to be used for:

- ProcessLogMemoryContextInterrupt()
- messages before successful authentication
- protocol sync loss, where we'd fail to send a client message
- client already gone

The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT.
If a client has decided to set client_min_messages that high, the client might
be interested in the fact that it got side-tracked completing someone else's
IO.  On the other hand, almost none of those sidetrack events will produce
messages.  The main argument I'd envision for LOG_SERVER_ONLY is that we
consider the message content sensitive, but I don't see the message content as
materially sensitive.

Since you committed LOG_SERVER_ONLY, let's keep that decision.  My last draft
review discouraged it, but it doesn't matter.  pgaio_result_report() should
assert elevel != LOG to avoid future divergence.

> - Previously the buffer completion callback checked zero_damaged_pages - but
>   that's not right, the GUC hopefully is only set on a per-session basis

Good catch.  I've now audited the complete_shared callbacks for other variable
references and actions not acceptable there.  I found nothing beyond what you
found by v2.14.

On Sat, Mar 29, 2025 at 10:48:10AM -0400, Andres Freund wrote:
> On 2025-03-29 06:41:43 -0700, Noah Misch wrote:
> > On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote:

> Subject: [PATCH v2.14 01/29] Fix mis-attribution of checksum failure stats to
>  the wrong database

I've skipped reviewing this patch, since it's already committed.  If it needs
post-commit review, let me know.

> Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd

> +        /*
> +         * Immediately log a message about the IO error, but only to the
> +         * server log. The reason to do so immediately is that the originator
> +         * might not process the query result immediately (because it is busy
> +         * doing another part of query processing) or at all (e.g. if it was
> +         * cancelled or errored out due to another IO also failing).  The
> +         * issuer of the IO will emit an ERROR when processing the IO's

s/issuer/definer/ please, to avoid proliferating synonyms.  Likewise two other
places in the patches.

> +/*
> + * smgrstartreadv() -- asynchronous version of smgrreadv()
> + *
> + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
> + * `ioh` all parameters are the same as smgrreadv().

I would add a comment starting with:

  Compared to smgrreadv(), more responsibilities fall on layers above smgr.
  Higher layers handle partial reads.  smgr will ereport(LOG_SERVER_ONLY) some
  problems, but higher layers are responsible for pgaio_result_report() to
  mirror that news to the user and (for ERROR) abort the (sub)transaction.

md_readv_complete() comment "issuer of the IO will emit an ERROR" says some of
that, but someone adding a smgrstartreadv() call is less likely to find it
there.

I say "comment starting with", because I think there's a remaining decision
about who owns the zeroing currently tied to smgrreadv().  An audit of
mdreadv() vs. AIO counterparts found this part of mdreadv():

            if (nbytes == 0)
            {
                /*
                 * We are at or past EOF, or we read a partial block at EOF.
                 * Normally this is an error; upper levels should never try to
                 * read a nonexistent block.  However, if zero_damaged_pages
                 * is ON or we are InRecovery, we should instead return zeroes
                 * without complaining.  This allows, for example, the case of
                 * trying to update a block that was later truncated away.
                 */
                if (zero_damaged_pages || InRecovery)
                {

I didn't write a test to prove its absence, but I'm not finding such code in
the AIO path.  I wondered if we could just Assert(!InRecovery), but adding
that to md_readv_complete() failed 001_stream_rep.pl with this stack:

ExceptionalCondition at assert.c:52
md_readv_complete at md.c:2043
pgaio_io_call_complete_shared at aio_callback.c:258
pgaio_io_process_completion at aio.c:515
pgaio_io_perform_synchronously at aio_io.c:148
pgaio_io_stage at aio.c:453
pgaio_io_start_readv at aio_io.c:87
FileStartReadV at fd.c:2243
mdstartreadv at md.c:1005
smgrstartreadv at smgr.c:757
AsyncReadBuffers at bufmgr.c:1938
StartReadBuffersImpl at bufmgr.c:1422
StartReadBuffer at bufmgr.c:1515
ReadBuffer_common at bufmgr.c:1246
ReadBufferExtended at bufmgr.c:818
vm_readbuf at visibilitymap.c:584
visibilitymap_pin at visibilitymap.c:203
heap_xlog_insert at heapam_xlog.c:450
heap_redo at heapam_xlog.c:1195
ApplyWalRecord at xlogrecovery.c:1995
PerformWalRecovery at xlogrecovery.c:1825
StartupXLOG at xlog.c:5895

If this is a real problem, fix options may include:

- Implement the InRecovery zeroing for real.
- Make the InRecovery case somehow use real mdreadv(), not
  pgaio_io_perform_synchronously() to use AIO APIs with synchronous AIO.  I'll
  guess this is harder than the previous option, though.

> Subject: [PATCH v2.14 04/29] aio: Add pg_aios view

> +        /*
> +         * There is no lock that could prevent the state of the IO to advance
> +         * concurrently - and we don't want to introduce one, as that would
> +         * introduce atomics into a very common path. Instead we
> +         *
> +         * 1) Determine the state + generation of the IO.
> +         *
> +         * 2) Copy the IO to local memory.
> +         *
> +         * 3) Check if state or generation of the IO changed. If the state
> +         * changed, retry, if the generation changed don't display the IO.
> +         */
> +
> +        /* 1) from above */
> +        start_generation = live_ioh->generation;
> +        pg_read_barrier();

I think "retry:" needs to be here, above start_state assignment.  Otherwise,
the "live_ioh->state != start_state" test will keep seeing a state mismatch.

> +        start_state = live_ioh->state;
> +
> +retry:
> +        if (start_state == PGAIO_HS_IDLE)
> +            continue;


> Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well
> Subject: [PATCH v2.14 07/29] aio: Add WARNING result status
> Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in
>  critical sections
> Subject: [PATCH v2.14 09/29] Add errhint_internal()

Ready for commit


> Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support

> Buffer reads executed this infrastructure will report invalid page / checksum
> errors / warnings differently than before:

s/this/through this/

> +    *zeroed_or_error_count = rem_error & ((1 << 7) - 1);
> +    rem_error >>= 7;

These raw "7" are good places to use your new #define values.  Likewise in
buffer_readv_encode_error().

> + *   that was errored or zerored or, if no errors/zeroes, the first ignored

s/zerored/zeroed/

> +     * enough. If there is an error, the error is the integeresting offset,

typo "integeresting"

> +/*
> + * We need a backend-local completion callback for shared buffers, to be able
> + * to report checksum errors correctly. Unfortunately that can only safely
> + * happen if the reporting backend has previously called

Missing end of sentence.

> @@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
>       */

There's an outdated comment ending here:

    /*
     * Throw a WARNING if the checksum fails, but only after we've checked for
     * the all-zeroes case.
     */

>      if (checksum_failure)
>      {
> -        if ((flags & PIV_LOG_WARNING) != 0)
> -            ereport(WARNING,
> +        if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0)
> +            ereport(flags & PIV_LOG_WARNING ? WARNING : LOG,


> Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control
>  ignore_checksum_failure
> Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers()
> Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design
> Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO
> Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode
>  support
> Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as
>  wait time
> Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems

Ready for commit


> Subject: [PATCH v2.14 18/29] aio: Add test_aio module

I didn't yet re-review the v2.13 or 2.14 changes to this one.  That's still in
my queue.  One thing I noticed anyway:

> +# Tests using injection points. Mostly to exercise had IO errors that are

s/had/hard/


On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote:
> On 2025-03-29 10:48:10 -0400, Andres Freund wrote:
> > Attached is v2.14:

> > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above)
> 
> I think I found an issue with this one - as it stands the view was viewable by
> everyone. While it doesn't provide a *lot* of insight, it still seems a bit
> too much for an unprivileged user to learn what part of a relation any other
> user is currently reading.
> 
> There'd be two different ways to address that:
> 1) revoke view & function from public, grant to a limited role (presumably
>    pg_read_all_stats)
> 2) copy pg_stat_activity's approach of using something like
> 
>    #define HAS_PGSTAT_PERMISSIONS(role)     (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
> 
>    on a per-IO basis.

No strong opinion.  I'm not really worried about any of this information
leaking.  Nothing in pg_aios comes close to the sensitivity of
pg_stat_activity.query.  pg_stat_activity is mighty cautious, hiding even
stuff like wait_event_type that I wouldn't worry about.  Hence, another valid
choice is (3) change nothing.

Meanwhile, I see substantially less need to monitor your own IOs than to
monitor your own pg_stat_activity rows, and even your own IOs potentially
reveal things happening in other sessions, e.g. evicting buffers that others
read and you never read.  So restrictions wouldn't be too painful, and (1)
arguably helps privacy more than (2).

I'd likely go with (1) today.



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Using read stream in autoprewarm
Next
From: Alexander Korotkov
Date:
Subject: Re: Replace IN VALUES with ANY in WHERE clauses during optimization