Re: [patch] Fix checksum verification in base backups for zero page headers - Mailing list pgsql-hackers

From Michael Banck
Subject Re: [patch] Fix checksum verification in base backups for zero page headers
Date
Msg-id 4b1c16216dcb60e1dad96142d25fd06fb77485bf.camel@credativ.de
Whole thread Raw
In response to Re: [patch] Fix checksum verification in base backups for zero page headers  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Responses Re: [patch] Fix checksum verification in base backups for zero page headers  (Michael Banck <michael.banck@credativ.de>)
List pgsql-hackers
Hi,

Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova:
> On 01.09.2020 13:22, Michael Banck wrote:
> > as a continuation of [1], I've split-off the zero page header case from
> > the last patch, as this one seems less contentious.
> > [1] https://commitfest.postgresql.org/28/2308/
> 
> I've looked through the previous discussion. As far as I got it, most of 
> the controversy was about online checksums improvements.
> 
> The warning about pd_upper inconsistency that you've added is a good 
> addition. The patch is a bit messy, though, because a huge code block 
> was shifted.
> 
> Will it be different, if you just leave
>      "if (!PageIsNew(page) && PageGetLSN(page) < startptr)"
> block as it was, and add
>      "else if (PageIsNew(page) && !PageIsZero(page))" ?

Thanks, that does indeed look better as a patch and I think it's fine
as-is for the code as well, I've attached a v2.

> While on it, I also have a few questions about the code around:

All fair points, but I think those should be handled in another patch,
if any.

> 1) Maybe we can use other header sanity checks from PageIsVerified() as 
> well? Or even the function itself.

Yeah, I'll keep that in mind.

> 2) > /* Reset loop to validate the block again */
> How many times do we try to reread the block? Is one reread enough?

Not sure whether more rereads would help, but I think the general
direction was to replace this with something better anyway I think (see
below).

> Speaking of which, 'reread_cnt' name looks confusing to me. I would 
> expect that this variable contains a number of attempts, not the number 
> of bytes read.

Well, there are other "cnt" variables that keep the number of bytes read
in that file, so it does not seem to be out of place for me. A comment
might not hurt though.

On second glance, it should maybe also be of size_t and not int type, as
is the other cnt variable?

> If everyone agrees, that for basebackup purpose it's enough to rely on a 
> single reread, I'm ok with it.
> Another approach is to read the page directly from shared buffers to 
> ensure that the page is fine. This way is more complicated, but should 
> be almost bullet-proof. Using it we can also check pages with lsn >= 
> startptr.

Right, that's what Andres also advocated for initially I think, and what
should be done going forward.

> 3) Judging by warning messages, we count checksum failures per file, not 
> per page, and don't report after a fifth failure. Why so?  Is it a 
> common case that so many pages of one file are corrupted?

I think this was a request during the original review, on the basis that
if we have more than a few checksum errors, it's likely there is
something fundamentally broken and it does not make sense to spam the
log with potentially millions of broken checksums.


Cheers,

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pgindent vs dtrace on macos
Next
From: Michael Banck
Date:
Subject: Re: [patch] Fix checksum verification in base backups for zero page headers