Re: new heapcheck contrib module - Mailing list pgsql-hackers

From Robert Haas
Subject Re: new heapcheck contrib module
Date
Msg-id CA+Tgmob2c0eM8+5kzkXaqdc9XbBCkHmtihSOSk-jCzzT4BFhFQ@mail.gmail.com
Whole thread Raw
In response to Re: new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers
On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> It will reconnect and retry a command one time on error.  That should cover the case that the connection to the
databasewas merely lost.  If the second attempt also fails, no further retry of the same command is attempted, though
commandsfor remaining relation targets will still be attempted, both for the database that had the error and for other
remainingdatabases in the list. 
>
> Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per
relationin db2 before finally moving on to db3.  That seems pretty awful considering how many relations that could be,
butfailing to soldier on in the face of errors seems a strange design for a corruption checking tool. 

That doesn't seem right at all. I think a PQsendQuery() failure is so
remote that it's probably justification for giving up on the entire
operation. If it's caused by a problem with some object, it probably
means that accessing that object caused the whole database to go down,
and retrying the object will take the database down again. Retrying
the object is betting that the user interrupted connectivity between
pg_amcheck and the database but the interruption is only momentary and
the user actually wants to complete the operation. That seems unlikely
to me. I think it's far more probably that the database crashed or got
shut down and continuing is futile.

My proposal is: if we get an ERROR trying to *run* a query, give up on
that object but still try the other ones after reconnecting. If we get
a FATAL or PANIC trying to *run* a query, give up on the entire
operation. If even sending a query fails, also give up.

> In v39, exit(1) is used for all errors which are intended to stop the program.  It is important to recognize that
findingcorruption is not an error in this sense.  A query to verify_heapam() can fail if the relation's checksums are
bad,and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers.  There can be
errorsif the file backing a relation is missing.  There may be other corruption error cases that I have not yet thought
about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them.  As
discussedabove, failing to send the query to the server is not viewed as a reason to exit, either.  It would be hard to
quantifyall the failure modes, but presumably the catalogs for a database could be messed up enough to cause such
failures,and I'm not sure that pg_amcheck should just abort. 

I agree that exit(1) should happen after any error intended to stop
the program. But I think it should also happen at the end of the run
if we hit any problems for which we did not stop, so that exit(0)
means your database is healthy.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Tid scan improvements
Next
From: Tom Lane
Date:
Subject: Re: Some regular-expression performance hacking