Home > mailing lists

Re: Online verification of checksums - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Online verification of checksums
Date	March 6, 2019 19:37:39
Msg-id	b15d1e0b-2e66-1cb8-65e0-dc51b4fe7d2f@2ndquadrant.com Whole thread Raw
In response to	Re: Online verification of checksums (Andres Freund <andres@anarazel.de>)
Responses	Re: Online verification of checksums
List	pgsql-hackers

Tree view


On 3/6/19 6:42 PM, Andres Freund wrote:
> On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
>> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote:
>>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
>>>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
>>>> <michael.banck@credativ.de> wrote:
>>>>> I have added a retry for this as well now, without a pg_sleep() as well.
>>>>> This catches around 80% of the half-reads, but a few slip through. At
>>>>> that point we bail out with exit(1), and the user can try again, which I
>>>>> think is fine?
>>>>
>>>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
>>>> robust at all.
>>>
>>> The chance that pg_verify_checksums hits a torn page (at least in my
>>> tests, see below) is already pretty low, a couple of times per 1000
>>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>>> on. Otherwise, we now just issue a warning and skip the file (or so was
>>> the idea, see below), do you think that is not acceptable?
>>
>> Yeah.  Consider a paranoid customer with 100 clusters who runs this
>> every day on every cluster.  They're going to see failures every day
>> or three and go ballistic.
> 
> +1
> 
> 
>> I suspect that better retry logic might help here.  I mean, I would
>> guess that 10 retries at 1 second intervals or something of that sort
>> would be enough to virtually eliminate false positives while still
>> allowing us to report persistent -- and thus real -- problems.  But if
>> even that is going to produce false positives with any measurable
>> probability different from zero, then I think we have a problem,
>> because I neither like a verification tool that ignores possible signs
>> of trouble nor one that "cries wolf" when things are fine.
> 
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
> 

Yes, if we could ensure the retry happens after completing the current
I/O on the page (without actually initiating a read into shared buffers)
that would work I think - both for partial reads and torn pages.

Not sure how to integrate it into the CLI tool, though. Perhaps we it
could require connection info so that it can execute a function, when
executed in online mode?

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Andres Freund
Date: 06 March 2019, 19:33:24
Subject: Re: Pluggable Storage - Andres's take

From: Andres Freund
Date: 06 March 2019, 19:41:51
Subject: Re: Online verification of checksums

Re: Online verification of checksums - Mailing list pgsql-hackers

Previous

Next