Re: Online verification of checksums - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Online verification of checksums
Date
Msg-id 20201116002324.GB2656@paquier.xyz
Whole thread Raw
In response to Re: Online verification of checksums  (Magnus Hagander <magnus@hagander.net>)
Responses Re: Online verification of checksums  (Magnus Hagander <magnus@hagander.net>)
List pgsql-hackers
On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael@paquier.xyz> wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header.  Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong.  Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should.  However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed.  Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
>
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
>
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.

Which part are you considering as reasonable?  The removal-feature
part on a stable branch or perhaps something else?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Move OpenSSL random under USE_OPENSSL_RANDOM
Next
From: Michael Paquier
Date:
Subject: Re: Remove unused variable from SharedSort