Re: [PATCH] Verify Checksums during Basebackups - Mailing list pgsql-hackers

From David Steele
Subject Re: [PATCH] Verify Checksums during Basebackups
Date
Msg-id 1a230b05-6cfc-2557-20cc-09e7a176230c@pgmasters.net
Whole thread Raw
In response to Re: [PATCH] Verify Checksums during Basebackups  (Michael Banck <michael.banck@credativ.de>)
Responses Re: [PATCH] Verify Checksums during Basebackups  (Michael Banck <michael.banck@credativ.de>)
List pgsql-hackers
Hi Michael,

On 3/23/18 5:36 AM, Michael Banck wrote:
> Am Donnerstag, den 22.03.2018, 12:22 -0400 schrieb David Steele:
>>
>> +    if (phdr->pd_checksum != checksum)
>>
>> I've attached a patch that adds basic retry functionality.  It's not
>> terrible efficient since it rereads the entire buffer for any block
>> error.  A better way is to keep a bitmap for each block in the buffer,
>> then on retry compare bitmaps.  If the block is still bad, report it.
>> If the block was corrected moved on.  If a block was good before but is
>> bad on retry it can be ignored.
> 
> I have to admit I find it a bit convoluted and non-obvious on first
> reading, but I'll try to check it out some more.

Yeah, I think I was influenced too much by how pgBackRest does things,
which doesn't work as well here.  Attached is a simpler version.

> I think it would be very useful if we could come up with a testcase
> showing this problem, but I guess this will be quite hard to hit
> reproducibly, right?

This was brought up by Robert in [1] when discussing validating
checksums during backup.  I don't know of any way to reproduce this
issue but it seems perfectly possible, if highly unlikely.

>> +    ereport(WARNING,
>> +        (errmsg("checksum verification failed in file "
>>
>> I'm worried about how verbose this warning could be since there are
>> 131,072 blocks per segment.  It's unlikely to have that many block
>> errors, but users do sometimes put files in PGDATA which look like they
>> should be validated.  Since these warnings all go to the server log it
>> could get pretty bad.
> 
> We only verify checksums of files in tablespaces, and I don't think
> dropping random files in those is supported in any way, as opposed to
> maybe the top-level PGDATA directory. So I would say that this is not a
> real concern.

Perhaps, but a very corrupt file is still going to spew lots of warnings
into the server log.

>> We should either stop warning after the first failure, or aggregate the
>> failures for a file into a single message.
> 
> I agree that major corruption could make the whole output blow up but I
> would prefer to keep this feature simple for now, which implies possibly
>  printing out a lot of WARNING or maybe just stopping after the first
> one (or first few, dunno).

In my experience actual block errors are relatively rare, so there
aren't likely to be more than a few in a file.  More common are
overwritten or transposed files, rogue files, etc.  These produce a lot
of output.

Maybe stop after five?

Regards,
-- 
-David
david@pgmasters.net

[1]
https://www.postgresql.org/message-id/CA%2BTgmobHd%2B-yVJHofSWg%3Dg%2B%3DA3EiCN2wsAiEyj7dj1hhevNq9Q%40mail.gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: PQHost() undefined behavior if connecting string contains bothhost and hostaddr types
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)