Re: AIO v2.5 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.5
Date
Msg-id ym3dqpa4xcvoeknewcw63x77vnqdosbqcetjinb2zfoh65k55m@m4ozmwhr6lk6
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
Hi,

On 2025-03-28 08:54:42 -0400, Andres Freund wrote:
> On 2025-03-27 20:22:23 -0700, Noah Misch wrote:
> > On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> > > - We need to register a local callback for shared buffer reads, which don't
> > >   need them today . That's a small bit of added overhead. It's a shame to do
> > >   so for counters that approximately never get incremented.
> >
> > Fair concern.  An idea is to let the complete_shared callback change the
> > callback list associated with the IO, so it could change
> > PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW.  The
> > latter would differ from the former only in having the extra local callback.
> > Could that help?  I think the only overhead is using more PGAIO_HCB numbers.
>
> I think changing the callback could work - I'll do some measurements in a
> coffee or two, but I suspect the overhead is not worth being too worried about
> for now.  There's a different aspect that worries me slightly more, see
> further down.
> ...
> Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
> local hashtable. That's more expensive than an indirect function call
> (i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
> not we can apply a mini-cache for the current database, which is surely the
> only thing that ever matters for performance.

I tried it and at ~30GB/s of read IO, with checksums disabled, I can't see a
difference of either having the unnecessary complete_local callback or having
the lookup in pgstat_prepare_report_checksum_failure(). In a profile there are
a few hits inside pgstat_get_entry_ref(), but not enough to matter.

Hence I think this isn't worth worrying about, at least for now. I think we
have far bigger fish to fry at this point than such a small performance
difference.

I've adjusted the comment above TRACE_POSTGRESQL_BUFFER_READ_DONE() to not
mention the overhead. I'm still inclined to think that it's better to call it
in the shared completion callback.


I also fixed support and added tests for ignore_checksum_failure, that also
needs to be determined at the start of the IO, not in the completion.  Once
more there were no tests, of course.


I spent the last 6 hours on the stupid error/warning messages around this,
somewhat ridiculous.

The number of combinations is annoyingly large. It's e.g. plausible to use
ignore_checksum_failure=on and zero_damaged_pages=on at the same time for
recovery. The same buffer could both be ignored *and* zeroed. Or somebody
could use ignore_checksum_failure=on but then still encounter a page that is
invalid.

But I finally got to a point where the code ends up readable, without undue
duplication.  It would, leaving some nasty hack aside, require a
errhint_internal() - but I can't imagine a reason against introducing that,
given we have it for the errmsg and errhint.

Here's the relevant code:

    /*
     * Treat a read that had both zeroed buffers *and* ignored checksums as a
     * special case, it's too irregular to be emitted the same way as the other
     * cases.
     */
    if (zeroed_any && ignored_any)
    {
        Assert(zeroed_any && ignored_any);
        Assert(nblocks > 1);    /* same block can't be both zeroed and ignored */
        Assert(result.status != PGAIO_RS_ERROR);
        affected_count = zeroed_or_error_count;

        ereport(elevel,
                errcode(ERRCODE_DATA_CORRUPTED),
                errmsg("zeroing %u pages and ignoring %u checksum failures among blocks %u..%u of relation %s",
                       affected_count, checkfail_count, first, last, rpath.str),
                affected_count > 1 ?
                errdetail("Block %u held first zeroed page.",
                          first + first_off) : 0,
                errhint("See server log for details about the other %u invalid blocks.",
                        affected_count + checkfail_count - 1));
        return;
    }

    /*
     * The other messages are highly repetitive. To avoid duplicating a long
     * and complicated ereport(), gather the translated format strings
     * separately and then do one common ereport.
     */
    if (result.status == PGAIO_RS_ERROR)
    {
        Assert(!zeroed_any);    /* can't have invalid pages when zeroing them */
        affected_count = zeroed_or_error_count;
        msg_one = _("invalid page in block %u of relation %s");
        msg_mult = _("%u invalid pages among blocks %u..%u of relation %s");
        det_mult = _("Block %u held first invalid page.");
        hint_mult = _("See server log for the other %u invalid blocks.");
    }
    else if (zeroed_any && !ignored_any)
    {
        affected_count = zeroed_or_error_count;
        msg_one = _("invalid page in block %u of relation %s; zeroing out page");
        msg_mult = _("zeroing out %u invalid pages among blocks %u..%u of relation %s");
        det_mult = _("Block %u held first zeroed page.");
        hint_mult = _("See server log for the other %u zeroed blocks.");
    }
    else if (!zeroed_any && ignored_any)
    {
        affected_count = checkfail_count;
        msg_one = _("ignoring checksum failure in block %u of relation %s");
        msg_mult = _("ignoring %u checksum failures among blocks %u..%u of relation %s");
        det_mult = _("Block %u held first ignored page.");
        hint_mult = _("See server log for the other %u ignored blocks.");
    }
    else
        pg_unreachable();

    ereport(elevel,
            errcode(ERRCODE_DATA_CORRUPTED),
            affected_count == 1 ?
            errmsg_internal(msg_one, first + first_off, rpath.str) :
            errmsg_internal(msg_mult, affected_count, first, last, rpath.str),
            affected_count > 1 ? errdetail_internal(det_mult, first + first_off) : 0,
            affected_count > 1 ? errhint_internal(hint_mult, affected_count - 1) : 0);

Does that approach make sense?

What do you think about using
  "zeroing invalid page in block %u of relation %s"
instead of
  "invalid page in block %u of relation %s; zeroing out page"

I thought about instead translating "ignoring", "ignored", "zeroing",
"zeroed", etc separately, but I have doubts about how well that would actually
translate.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: Statistics Import and Export
Next
From: Alexey Makhmutov
Date:
Subject: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues