Re: AIO v2.5 - Mailing list pgsql-hackers

From Noah Misch
Subject Re: AIO v2.5
Date
Msg-id 20250325141120.8e.nmisch@google.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Mon, Mar 24, 2025 at 10:52:19PM -0400, Andres Freund wrote:
> On 2025-03-24 19:20:37 -0700, Noah Misch wrote:
> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > > +static pg_attribute_always_inline PgAioResult
> > > +buffer_readv_complete_one(uint8 buf_off, Buffer buffer, uint8 flags,
> > > +                          bool failed, bool is_temp)
> > > +{
> > ...
> > > +        if ((flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
> > > +        {
> > > +            ereport(WARNING,
> > > +                    (errcode(ERRCODE_DATA_CORRUPTED),
> > > +                     errmsg("invalid page in block %u of relation %s; zeroing out page",
> > 
> > My earlier review requested s/LOG/WARNING/, but I wasn't thinking about this
> > in full depth. In the !is_temp case, this runs in a complete_shared
> > callback.  A process unrelated to the original IO may run this callback.
> > That's unfortunate in two ways.  First, that other process's client gets an
> > unexpected WARNING.  The process getting the WARNING may not even have
> > zero_damaged_pages enabled.  Second, the client of the process that staged
> > the IO gets no message.
> 
> Ah, right. That could be why I had flipped it. If so, shame on me for not
> adding a comment...
> 
> 
> > AIO ERROR-level messages handle this optimally.  We emit a LOG-level message
> > in the process that runs the complete_shared callback, and we arrange for the
> > ERROR-level message in the stager.  That would be ideal here: LOG in the
> > complete_shared runner, WARNING in the stager.
> 
> We could obviously downgrade (crossgrade? A LOG is more severe than a LOG in
> some ways, but not others) the message when run in a different backend fairly
> easily.  Still emitting a WARNING in the stager however is a bit more tricky.
> 
> Before thinking more deeply about how we could emit WARNING in the stage:
> 
> Is it actually sane to use WARNING here? At least for ZERO_ON_ERROR that could
> trigger a rather massive flood of messages to the client in a *normal*
> situation. I'm thinking of something like an insert extending a relation some
> time after an immediate restart and encountering a lot of FSM corruption (due
> to its non-crash-safe-ness) during the search for free space and the
> subsequent FSM vacuum.  It might be ok to LOG that, but sending a lot of
> WARNINGs to the client seems not quite right.

Orthogonal to AIO, I do think LOG (or even DEBUG1?) is better for
ZERO_ON_ERROR.  The ZERO_ON_ERROR case also should not use
ERRCODE_DATA_CORRUPTED.  (That errcode shouldn't appear for business as usual.
It should signify wrong or irretrievable query results, essentially.)

For zero_damaged_pages, WARNING seems at least defensible, and
ERRCODE_DATA_CORRUPTED is right.  It wouldn't be the worst thing to change
zero_damaged_pages to LOG and let the complete_shared runner log it, as long
as we release-note that.  It's superuser-only, and the superuser can learn to
check the log.  One typically should use zero_damaged_pages in one session at
a time, so the logs won't be too confusing.

Another thought on complete_shared running on other backends: I wonder if we
should push an ErrorContextCallback that adds "CONTEXT: completing I/O of
other process" or similar, so people wonder less about how "SELECT FROM a" led
to a log message about IO on table "b".

> If we want to implement it, I think we could introduce PGAIO_RS_WARN, which
> then could tell the stager to issue the WARNING. It would add a bit of
> distributed cost, both to callbacks and users of AIO, but it might not be too
> bad.
> 
> 
> > One could simplify things by forcing io_method=sync under ZERO_ON_ERROR ||
> > zero_damaged_pages, perhaps as a short-term approach.
> 
> Yea, that could work.  Perhaps even just for zero_damaged_pages, after
> changing it so that ZERO_ON_ERROR always just LOGs.

Yes.

> Hm, it seems somewhat nasty to have rather different performance
> characteristics when forced to use zero_damaged_pages to recover from a
> problem. Imagine an instance that's configured to use DIO and then needs to
> use zero_damaged_pages to recove from corruption...

True.  I'd be willing to bet high-scale use of zero_damaged_pages is rare.  By
high scale, I mean something like reading a whole large table, as opposed to a
TID scan of the known-problematic range.  That said, people (including me)
expect the emergency tools to be good even if they're used rarely.  You're not
wrong to worry about it.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Support for NO INHERIT to INHERIT state change with named NOT NULL constraints
Next
From: Daniel Gustafsson
Date:
Subject: Re: Enhancing Memory Context Statistics Reporting