Re: [PATCH] Verify Checksums during Basebackups - Mailing list pgsql-hackers
From | Magnus Hagander |
---|---|
Subject | Re: [PATCH] Verify Checksums during Basebackups |
Date | |
Msg-id | CABUevEyPm78CNcSHC0vDQ-AqGMO1oLbcmwn8G_pVjP84dvmVQA@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Verify Checksums during Basebackups (David Steele <david@pgmasters.net>) |
Responses |
Re: [PATCH] Verify Checksums during Basebackups
|
List | pgsql-hackers |
On Fri, Mar 30, 2018 at 5:35 AM, David Steele <david@pgmasters.net> wrote:
On 3/24/18 10:32 AM, Michael Banck wrote:
> Am Freitag, den 23.03.2018, 17:43 +0100 schrieb Michael Banck:
>> Am Freitag, den 23.03.2018, 10:54 -0400 schrieb David Steele:
>>> In my experience actual block errors are relatively rare, so there
>>> aren't likely to be more than a few in a file. More common are
>>> overwritten or transposed files, rogue files, etc. These produce a lot
>>> of output.
>>>
>>> Maybe stop after five?
>
> The attached patch does that, and outputs the total number of
> verification failures of that file after it got sent.
>
>> I'm on board with this, but I have the feeling that this is not a very
>> common pattern in Postgres, or might not be project style at all. I
>> can't remember even seen an error message like that.
>>
>> Anybody know whether we're doing this in a similar fashion elsewhere?
>
> I tried to have look around and couldn't find any examples, so I'm not
> sure that patch should go in. On the other hand, we abort on checksum
> failures usually (in pg_dump e.g.), so limiting the number of warnings
> does makes sense.
>
> I guess we need to see what others think.
Well, at this point I would say silence more or less gives consent.
Can you provide a rebased patch with the validation retry and warning
limiting logic added? I would like to take another pass through it but I
think this is getting close.
I was meaning to mention it, but ran out of cycles.
I think this is the right way to do it, except the 5 should be a #define and not an inline hardcoded value :) We could argue whether it should be "5 total" or "5 per file". When I read the emails I thought it was going to be 5 total, but I see the implementation does 5 / file. In a super-damanged system that will still lead to horrible amounts of logging, but I think maybe if your system is in that bad shape, then it's a lost cause anyway.
I also think the "total number of checksum errors" should be logged if they're >0, not >5. And I think *that* one should be logged at the end of the entire process, not per file. That'd be the kind of output that would be the most interesting, I think (e.g. if I have it spread out with 1 block each across 4 files, I want that logged at the end because it's easy to otherwise miss one or two of them that may have happened a long time apart).
I don't think we have a good comparison elsewhere, and that is as David mention because other codepaths fail hard when they run into something like that. And we explicitly want to *not* fail hard, per previous discussion.
pgsql-hackers by date: