Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: Online checksums verification in the backend
Date
Msg-id CAOBaU_YZTowoJJ+NzsBieV0K5OvpOS+dAMD1cmz25hxtVZpQqA@mail.gmail.com
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Online checksums verification in the backend
List pgsql-hackers
On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote:
> > And Michael just told me that I also missed adding one of the C files
> > while splitting the patch into two.
>
> +   if (PageIsNew(page))
> +   {
> +       /*
> +        * Check if the page is really new or if there's corruption that
> +        * affected PageIsNew detection.  Note that PageIsVerified won't try to
> +        * detect checksum corruption in this case, so there's no risk of
> +        * duplicated corruption report.
> +        */
> +       if (PageIsVerified(page, blkno))
> +       {
> +           /* No corruption. */
> +           return true;
> +       }
> Please note that this part of your patch overlaps with a proposal for
> a bug fix related to zero-only pages with the checksum verification of
> base backups:
> https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru
>
> Your patch is trying to adapt itself to the existing logic we have in
> PageIsVerified() so as you don't get a duplicated report, as does the
> base backup part.  Note that I cannot find in the wild any open code
> making use of PageIsVerified(), but I'd like to believe that it is
> rather handy to use for table AMs at least (?), so if we can avoid any
> useless ABI breakage, it may be better to have a new
> PageIsVerifiedExtended() able to take additional options, one to
> report to pgstat and one to generate this elog(WARNING).  And then
> this patch could just make use of it?

Indeed, that would be great.

> +       /*
> +        * There's corruption, but since this affects PageIsNew, we
> +        * can't compute a checksum, so set NoComputedChecksum for the
> +        * expected checksum.
> +        */
> +       *chk_expected = NoComputedChecksum;
> +       *chk_found = hdr->pd_checksum;
> +       return false;
> [...]
> +       /*
> +        * This can happen if corruption makes the block appears as
> +        * PageIsNew() but isn't a new page.
> +        */
> +       if (chk_expected == NoComputedChecksum)
> +           nulls[i++] = true;
> +       else
> +           values[i++] = UInt16GetDatum(chk_expected);
> Somewhat related to the first point, NoComputedChecksum exists
> because, as the current patch is shaped, we need to report an existing
> checksum to the user even for the zero-only case.


I'm not sure that I understand your point.  The current patch only
returns something to users when there's a corruption.  If by
"zero-only case" you mean "page corrupted in a way that PageIsNew()
returns true while not being all zero", then it's a corrupted page and
then obviously yes it needs to be returned to users.

>  PageIsVerified() is
> not that flexible so we could change it to report a status depending
> on the error faced (checksum, header or zero-only) on top of getting a
> checksum.  Now, I am not completely sure either that it is worth the
> complication to return in the SRF of the check function the expected
> checksum.

It seemed to me that it could be something useful to get with this
kind of tool.  You may be able to recover a corrupted page from
backup/WAL if the checksum itself wasn't corrupted so that you know
what to look for.  There would be a lot of caveats and low level work,
but if you're desperate enough that may save you a bit of time.

>  So, wouldn't it be better to just rely on PageIsVerified()
> (well it's rather-soon-to-be extended flavor) for the checksum check,
> the header sanity check and the zero-only check?  My point is to keep
> a single entry point for all the page sanity checks, so as base
> backups, your patch, and the buffer manager apply the same things.
> Base backups got that partially wrong because the base backup code
> wants to keep control of the number of failures and the error
> reports.

I'm fine with it.

>  Your patch actually wishes to report a failure, but you want
> to add more context with the fork name and such.  Another option we
> could use here is to add an error context so as PageIsVerified()
> reports the WARNING, but the SQL function provides more context with
> the block number and the relation involved in the check.

Also, returning actual data rather than a bunch of warnings is way
easier to process for client code.  And as mentioned previously having
an API that returns a list of corrupted blocks could be useful for a
single-page recovery feature.



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Internal key management system
Next
From: Tatsuo Ishii
Date:
Subject: Re: Implementing Incremental View Maintenance