Re: Online verification of checksums - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Online verification of checksums
Date
Msg-id 20190306174204.mynxidkfvcw7nadg@alap3.anarazel.de
Whole thread Raw
In response to Re: Online verification of checksums  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Online verification of checksums
Re: Online verification of checksums
List pgsql-hackers
On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > > <michael.banck@credativ.de> wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > >
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> >
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
> 
> Yeah.  Consider a paranoid customer with 100 clusters who runs this
> every day on every cluster.  They're going to see failures every day
> or three and go ballistic.

+1


> I suspect that better retry logic might help here.  I mean, I would
> guess that 10 retries at 1 second intervals or something of that sort
> would be enough to virtually eliminate false positives while still
> allowing us to report persistent -- and thus real -- problems.  But if
> even that is going to produce false positives with any measurable
> probability different from zero, then I think we have a problem,
> because I neither like a verification tool that ignores possible signs
> of trouble nor one that "cries wolf" when things are fine.

To me the right way seems to be to IO lock the page via PG after such a
failure, and then retry. Which should be relatively easily doable for
the basebackup case, but obviously harder for the pg_verify_checksums
case.

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: variadic argument support for least, greatest function
Next
From: Alvaro Herrera
Date:
Subject: Re: Server Crash in logical decoding if used inside --single mode