Re: Online verification of checksums - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Online verification of checksums |
Date | |
Msg-id | b15d1e0b-2e66-1cb8-65e0-dc51b4fe7d2f@2ndquadrant.com Whole thread Raw |
In response to | Re: Online verification of checksums (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Online verification of checksums
|
List | pgsql-hackers |
On 3/6/19 6:42 PM, Andres Freund wrote: > On 2019-03-06 12:33:49 -0500, Robert Haas wrote: >> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.banck@credativ.de> wrote: >>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: >>>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck >>>> <michael.banck@credativ.de> wrote: >>>>> I have added a retry for this as well now, without a pg_sleep() as well. >>>>> This catches around 80% of the half-reads, but a few slip through. At >>>>> that point we bail out with exit(1), and the user can try again, which I >>>>> think is fine? >>>> >>>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound >>>> robust at all. >>> >>> The chance that pg_verify_checksums hits a torn page (at least in my >>> tests, see below) is already pretty low, a couple of times per 1000 >>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march >>> on. Otherwise, we now just issue a warning and skip the file (or so was >>> the idea, see below), do you think that is not acceptable? >> >> Yeah. Consider a paranoid customer with 100 clusters who runs this >> every day on every cluster. They're going to see failures every day >> or three and go ballistic. > > +1 > > >> I suspect that better retry logic might help here. I mean, I would >> guess that 10 retries at 1 second intervals or something of that sort >> would be enough to virtually eliminate false positives while still >> allowing us to report persistent -- and thus real -- problems. But if >> even that is going to produce false positives with any measurable >> probability different from zero, then I think we have a problem, >> because I neither like a verification tool that ignores possible signs >> of trouble nor one that "cries wolf" when things are fine. > > To me the right way seems to be to IO lock the page via PG after such a > failure, and then retry. Which should be relatively easily doable for > the basebackup case, but obviously harder for the pg_verify_checksums > case. > Yes, if we could ensure the retry happens after completing the current I/O on the page (without actually initiating a read into shared buffers) that would work I think - both for partial reads and torn pages. Not sure how to integrate it into the CLI tool, though. Perhaps we it could require connection info so that it can execute a function, when executed in online mode? cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: