Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Online checksums verification in the backend
Date
Msg-id 20201019075248.GD9612@paquier.xyz
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Julien Rouhaud <rjuju123@gmail.com>)
Responses Re: Online checksums verification in the backend  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Mon, Oct 19, 2020 at 11:16:38AM +0800, Julien Rouhaud wrote:
> On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote:
>> Somewhat related to the first point, NoComputedChecksum exists
>> because, as the current patch is shaped, we need to report an existing
>> checksum to the user even for the zero-only case.
>
> I'm not sure that I understand your point.  The current patch only
> returns something to users when there's a corruption.  If by
> "zero-only case" you mean "page corrupted in a way that PageIsNew()
> returns true while not being all zero", then it's a corrupted page and
> then obviously yes it needs to be returned to users.

Sorry for the confusion, this previous paragraph was confusing.  I
meant that the reason why NoComputedChecksum exists is that we give up
on attempting to calculate the checksum if we detect that the page is
new, but failed the zero-only test, and that we want the users to know
about this special case by setting this expected checksum to NULL for
the SRF.

>>  So, wouldn't it be better to just rely on PageIsVerified()
>> (well it's rather-soon-to-be extended flavor) for the checksum check,
>> the header sanity check and the zero-only check?  My point is to keep
>> a single entry point for all the page sanity checks, so as base
>> backups, your patch, and the buffer manager apply the same things.
>> Base backups got that partially wrong because the base backup code
>> wants to keep control of the number of failures and the error
>> reports.
>
> I'm fine with it.

Thanks.

>>  Your patch actually wishes to report a failure, but you want
>> to add more context with the fork name and such.  Another option we
>> could use here is to add an error context so as PageIsVerified()
>> reports the WARNING, but the SQL function provides more context with
>> the block number and the relation involved in the check.
>
> Also, returning actual data rather than a bunch of warnings is way
> easier to process for client code.  And as mentioned previously having
> an API that returns a list of corrupted blocks could be useful for a
> single-page recovery feature.

No issues with reporting the block number and the fork type in the SRF
at least of course as this is helpful to detect the position of the
broken blocks.  For the checksum found in the header and the one
calculated (named expected in the patch), I am less sure how to put a
clear definition on top of that but we could always consider that
later and extend the SRF as needed.  Once the user knows that both do
not match, he/she knows that there is a problem.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Resetting spilled txn statistics in pg_stat_replication
Next
From: Dmitry Shulga
Date:
Subject: Re: Reduce the time required for a database recovery from archive.