Re: new heapcheck contrib module - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: new heapcheck contrib module |
Date | |
Msg-id | CA+Tgmob2c0eM8+5kzkXaqdc9XbBCkHmtihSOSk-jCzzT4BFhFQ@mail.gmail.com Whole thread Raw |
In response to | Re: new heapcheck contrib module (Mark Dilger <mark.dilger@enterprisedb.com>) |
List | pgsql-hackers |
On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > It will reconnect and retry a command one time on error. That should cover the case that the connection to the databasewas merely lost. If the second attempt also fails, no further retry of the same command is attempted, though commandsfor remaining relation targets will still be attempted, both for the database that had the error and for other remainingdatabases in the list. > > Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per relationin db2 before finally moving on to db3. That seems pretty awful considering how many relations that could be, butfailing to soldier on in the face of errors seems a strange design for a corruption checking tool. That doesn't seem right at all. I think a PQsendQuery() failure is so remote that it's probably justification for giving up on the entire operation. If it's caused by a problem with some object, it probably means that accessing that object caused the whole database to go down, and retrying the object will take the database down again. Retrying the object is betting that the user interrupted connectivity between pg_amcheck and the database but the interruption is only momentary and the user actually wants to complete the operation. That seems unlikely to me. I think it's far more probably that the database crashed or got shut down and continuing is futile. My proposal is: if we get an ERROR trying to *run* a query, give up on that object but still try the other ones after reconnecting. If we get a FATAL or PANIC trying to *run* a query, give up on the entire operation. If even sending a query fails, also give up. > In v39, exit(1) is used for all errors which are intended to stop the program. It is important to recognize that findingcorruption is not an error in this sense. A query to verify_heapam() can fail if the relation's checksums are bad,and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers. There can be errorsif the file backing a relation is missing. There may be other corruption error cases that I have not yet thought about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them. As discussedabove, failing to send the query to the server is not viewed as a reason to exit, either. It would be hard to quantifyall the failure modes, but presumably the catalogs for a database could be messed up enough to cause such failures,and I'm not sure that pg_amcheck should just abort. I agree that exit(1) should happen after any error intended to stop the program. But I think it should also happen at the end of the run if we hit any problems for which we did not stop, so that exit(0) means your database is healthy. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: