Re: Online verification of checksums - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Online verification of checksums
Date
Msg-id 20201120160827.GY16415@tamriel.snowman.net
Whole thread Raw
In response to Re: Online verification of checksums  (David Steele <david@pgmasters.net>)
Responses Re: Online verification of checksums  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Greetings,

* David Steele (david@pgmasters.net) wrote:
> On 11/20/20 2:28 AM, Michael Paquier wrote:
> >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> >>I was referring to the latest patch on the thread. But as I said, I have
> >>not read up on all the different issues raised in the thread, so take it
> >>with a big grain os salt.
> >>
> >>And I would also echo the previous comment that this code was adapted from
> >>what the pgbackrest folks do. As such, it would be good to get a comment
> >>from for example David on that -- I don't see any of them having commented
> >>after that was mentioned?
> >
> >Agreed.  I am adding Stephen as well in CC.  From the code of
> >backrest, the same logic happens in src/command/backup/pageChecksum.c
> >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> >happen before verifying the checksum.  So, if the page header finishes
> >with random junk because of some kind of corruption, even corrupted
> >pages would be incorrectly considered as correct if the random data
> >passes the pd_upper and pg_lsn checks :/
>
> Indeed, this is not good, as Andres pointed out some time ago. My apologies
> for not getting to this sooner.

Yeah, it's been on our backlog to improve this.

> Our current plan for pgBackRest:
>
> 1) Remove the LSN check as you have done in your patch and when rechecking
> see if the page has become valid *or* the LSN is ascending.
> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> is valid.

Yup, that's my recollection also as to our plans for how to improve
things here.

> These do completely rule out any type of corruption, but they certainly
> narrows the possibility by a lot.

*don't :)

> In the future we would also like to scan the WAL to verify that the page is
> definitely being written to.

Yeah, that'd certainly be nice to do too.

> As for your patch, it mostly looks good but my objection is that a page may
> be reported as invalid after 5 retries when in fact it may just be very hot.

Yeah.. while unlikely that it'd actually get written out that much, it
does seem at least possible.

> Maybe checking for an ascending LSN is a good idea there as well? At least
> in that case we could issue a different warning, instead of "checksum
> verification failed" perhaps "checksum verification skipped due to
> concurrent modifications".

+1.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Disable WAL logging to speed up data loading
Next
From: Robert Haas
Date:
Subject: Re: VACUUM (DISABLE_PAGE_SKIPPING on)