Re: AIO v2.5 - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | 20250329212929.a6.nmisch@google.com Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: AIO v2.5
Re: AIO v2.5 |
List | pgsql-hackers |
Flushing half-baked review comments before going offline for a few hours: On Wed, Mar 26, 2025 at 09:07:40PM -0400, Andres Freund wrote: > Attached v2.13, with the following changes: > 5) The WARNING in the callback is now a LOG, as it will be sent to the > client as a WARNING explicitly when the IO's results are processed > > I actually chose LOG_SERVER_ONLY - that seemed slightly better than just > LOG? But not at all sure. LOG_SERVER_ONLY and its synonym COMMERR look to be used for: - ProcessLogMemoryContextInterrupt() - messages before successful authentication - protocol sync loss, where we'd fail to send a client message - client already gone The choice between LOG and LOG_SERVER_ONLY doesn't matter much for $SUBJECT. If a client has decided to set client_min_messages that high, the client might be interested in the fact that it got side-tracked completing someone else's IO. On the other hand, almost none of those sidetrack events will produce messages. The main argument I'd envision for LOG_SERVER_ONLY is that we consider the message content sensitive, but I don't see the message content as materially sensitive. Since you committed LOG_SERVER_ONLY, let's keep that decision. My last draft review discouraged it, but it doesn't matter. pgaio_result_report() should assert elevel != LOG to avoid future divergence. > - Previously the buffer completion callback checked zero_damaged_pages - but > that's not right, the GUC hopefully is only set on a per-session basis Good catch. I've now audited the complete_shared callbacks for other variable references and actions not acceptable there. I found nothing beyond what you found by v2.14. On Sat, Mar 29, 2025 at 10:48:10AM -0400, Andres Freund wrote: > On 2025-03-29 06:41:43 -0700, Noah Misch wrote: > > On Fri, Mar 28, 2025 at 11:35:23PM -0400, Andres Freund wrote: > Subject: [PATCH v2.14 01/29] Fix mis-attribution of checksum failure stats to > the wrong database I've skipped reviewing this patch, since it's already committed. If it needs post-commit review, let me know. > Subject: [PATCH v2.14 02/29] aio: Implement support for reads in smgr/md/fd > + /* > + * Immediately log a message about the IO error, but only to the > + * server log. The reason to do so immediately is that the originator > + * might not process the query result immediately (because it is busy > + * doing another part of query processing) or at all (e.g. if it was > + * cancelled or errored out due to another IO also failing). The > + * issuer of the IO will emit an ERROR when processing the IO's s/issuer/definer/ please, to avoid proliferating synonyms. Likewise two other places in the patches. > +/* > + * smgrstartreadv() -- asynchronous version of smgrreadv() > + * > + * This starts an asynchronous readv IO using the IO handle `ioh`. Other than > + * `ioh` all parameters are the same as smgrreadv(). I would add a comment starting with: Compared to smgrreadv(), more responsibilities fall on layers above smgr. Higher layers handle partial reads. smgr will ereport(LOG_SERVER_ONLY) some problems, but higher layers are responsible for pgaio_result_report() to mirror that news to the user and (for ERROR) abort the (sub)transaction. md_readv_complete() comment "issuer of the IO will emit an ERROR" says some of that, but someone adding a smgrstartreadv() call is less likely to find it there. I say "comment starting with", because I think there's a remaining decision about who owns the zeroing currently tied to smgrreadv(). An audit of mdreadv() vs. AIO counterparts found this part of mdreadv(): if (nbytes == 0) { /* * We are at or past EOF, or we read a partial block at EOF. * Normally this is an error; upper levels should never try to * read a nonexistent block. However, if zero_damaged_pages * is ON or we are InRecovery, we should instead return zeroes * without complaining. This allows, for example, the case of * trying to update a block that was later truncated away. */ if (zero_damaged_pages || InRecovery) { I didn't write a test to prove its absence, but I'm not finding such code in the AIO path. I wondered if we could just Assert(!InRecovery), but adding that to md_readv_complete() failed 001_stream_rep.pl with this stack: ExceptionalCondition at assert.c:52 md_readv_complete at md.c:2043 pgaio_io_call_complete_shared at aio_callback.c:258 pgaio_io_process_completion at aio.c:515 pgaio_io_perform_synchronously at aio_io.c:148 pgaio_io_stage at aio.c:453 pgaio_io_start_readv at aio_io.c:87 FileStartReadV at fd.c:2243 mdstartreadv at md.c:1005 smgrstartreadv at smgr.c:757 AsyncReadBuffers at bufmgr.c:1938 StartReadBuffersImpl at bufmgr.c:1422 StartReadBuffer at bufmgr.c:1515 ReadBuffer_common at bufmgr.c:1246 ReadBufferExtended at bufmgr.c:818 vm_readbuf at visibilitymap.c:584 visibilitymap_pin at visibilitymap.c:203 heap_xlog_insert at heapam_xlog.c:450 heap_redo at heapam_xlog.c:1195 ApplyWalRecord at xlogrecovery.c:1995 PerformWalRecovery at xlogrecovery.c:1825 StartupXLOG at xlog.c:5895 If this is a real problem, fix options may include: - Implement the InRecovery zeroing for real. - Make the InRecovery case somehow use real mdreadv(), not pgaio_io_perform_synchronously() to use AIO APIs with synchronous AIO. I'll guess this is harder than the previous option, though. > Subject: [PATCH v2.14 04/29] aio: Add pg_aios view > + /* > + * There is no lock that could prevent the state of the IO to advance > + * concurrently - and we don't want to introduce one, as that would > + * introduce atomics into a very common path. Instead we > + * > + * 1) Determine the state + generation of the IO. > + * > + * 2) Copy the IO to local memory. > + * > + * 3) Check if state or generation of the IO changed. If the state > + * changed, retry, if the generation changed don't display the IO. > + */ > + > + /* 1) from above */ > + start_generation = live_ioh->generation; > + pg_read_barrier(); I think "retry:" needs to be here, above start_state assignment. Otherwise, the "live_ioh->state != start_state" test will keep seeing a state mismatch. > + start_state = live_ioh->state; > + > +retry: > + if (start_state == PGAIO_HS_IDLE) > + continue; > Subject: [PATCH v2.14 05/29] localbuf: Track pincount in BufferDesc as well > Subject: [PATCH v2.14 07/29] aio: Add WARNING result status > Subject: [PATCH v2.14 08/29] pgstat: Allow checksum errors to be reported in > critical sections > Subject: [PATCH v2.14 09/29] Add errhint_internal() Ready for commit > Subject: [PATCH v2.14 10/29] bufmgr: Implement AIO read support > Buffer reads executed this infrastructure will report invalid page / checksum > errors / warnings differently than before: s/this/through this/ > + *zeroed_or_error_count = rem_error & ((1 << 7) - 1); > + rem_error >>= 7; These raw "7" are good places to use your new #define values. Likewise in buffer_readv_encode_error(). > + * that was errored or zerored or, if no errors/zeroes, the first ignored s/zerored/zeroed/ > + * enough. If there is an error, the error is the integeresting offset, typo "integeresting" > +/* > + * We need a backend-local completion callback for shared buffers, to be able > + * to report checksum errors correctly. Unfortunately that can only safely > + * happen if the reporting backend has previously called Missing end of sentence. > @@ -144,8 +144,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail > */ There's an outdated comment ending here: /* * Throw a WARNING if the checksum fails, but only after we've checked for * the all-zeroes case. */ > if (checksum_failure) > { > - if ((flags & PIV_LOG_WARNING) != 0) > - ereport(WARNING, > + if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0) > + ereport(flags & PIV_LOG_WARNING ? WARNING : LOG, > Subject: [PATCH v2.14 11/29] Let caller of PageIsVerified() control > ignore_checksum_failure > Subject: [PATCH v2.14 12/29] bufmgr: Use AIO in StartReadBuffers() > Subject: [PATCH v2.14 13/29] aio: Add README.md explaining higher level design > Subject: [PATCH v2.14 14/29] aio: Basic read_stream adjustments for real AIO > Subject: [PATCH v2.14 15/29] read_stream: Introduce and use optional batchmode > support > Subject: [PATCH v2.14 16/29] docs: Reframe track_io_timing related docs as > wait time > Subject: [PATCH v2.14 17/29] Enable IO concurrency on all systems Ready for commit > Subject: [PATCH v2.14 18/29] aio: Add test_aio module I didn't yet re-review the v2.13 or 2.14 changes to this one. That's still in my queue. One thing I noticed anyway: > +# Tests using injection points. Mostly to exercise had IO errors that are s/had/hard/ On Sat, Mar 29, 2025 at 02:25:15PM -0400, Andres Freund wrote: > On 2025-03-29 10:48:10 -0400, Andres Freund wrote: > > Attached is v2.14: > > - push pg_aios view (depends a tiny bit on the smgr/md/fd change above) > > I think I found an issue with this one - as it stands the view was viewable by > everyone. While it doesn't provide a *lot* of insight, it still seems a bit > too much for an unprivileged user to learn what part of a relation any other > user is currently reading. > > There'd be two different ways to address that: > 1) revoke view & function from public, grant to a limited role (presumably > pg_read_all_stats) > 2) copy pg_stat_activity's approach of using something like > > #define HAS_PGSTAT_PERMISSIONS(role) (has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(),role)) > > on a per-IO basis. No strong opinion. I'm not really worried about any of this information leaking. Nothing in pg_aios comes close to the sensitivity of pg_stat_activity.query. pg_stat_activity is mighty cautious, hiding even stuff like wait_event_type that I wouldn't worry about. Hence, another valid choice is (3) change nothing. Meanwhile, I see substantially less need to monitor your own IOs than to monitor your own pg_stat_activity rows, and even your own IOs potentially reveal things happening in other sessions, e.g. evicting buffers that others read and you never read. So restrictions wouldn't be too painful, and (1) arguably helps privacy more than (2). I'd likely go with (1) today.
pgsql-hackers by date: