Re: new heapcheck contrib module - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: new heapcheck contrib module
Date
Msg-id CAH2-Wz=AnzG_KvoR-KairFaZJavSzzDKfv2h30aJxM=RYxuUTg@mail.gmail.com
Whole thread Raw
In response to Re: new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: new heapcheck contrib module  (Robert Haas <robertmhaas@gmail.com>)
Re: new heapcheck contrib module  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Tue, May 12, 2020 at 7:07 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Thank you yet again for reviewing.  I really appreciate the feedback!

Happy to help. It's important work.

> Ok, I take your point that the code cannot soldier on after the first error is returned.  I'll change that for v6 of
thepatch, moving on to the next relation after hitting the first corruption in any particular index.  Do you mind that
Irefactored the code to return the error rather than ereporting? 

try/catch seems like the way to do it. Not all amcheck errors come
from amcheck -- some are things that the backend code does, that are
known to appear in amcheck from time to time. I'm thinking in
particular of the
table_index_build_scan()/heapam_index_build_range_scan() errors, as
well as the errors from _bt_checkpage().

> Yes, I agree that reindexing is the most sensible remedy.  I certainly have no plans to implement some pg_fsck_index
typetool.  Even for tables, I'm not interested in creating such a tool. I just want a good tool for finding out what
thenature of the corruption is, as that might make it easier to debug what went wrong.  It's not just for debugging
productionsystems, but also for chasing down problems in half-baked code prior to release. 

All good goals.

>  * check_tuphdr_xids

> The point is that when checking the table for corruption I avoid calling anything that might assert (or segfault, or
whatever).

I don't think that you can expect to avoid assertion failures in
general. I'll stick with your example. You're calling
TransactionIdDidCommit() from check_tuphdr_xids(), which will
interrogate the commit log and pg_subtrans. It's just not under your
control. I'm sure that you could get an assertion failure somewhere in
there, and even if you couldn't that could change at any time.

You've quasi-duplicated some sensitive code to do that much, which
seems excessive. But it's also not enough.

> I'm starting to infer from your comments that you see the landmine detection vehicle as also driving at high speed,
detectinglandmines on occasion by seeing them first, but frequently by failing to see them and just blowing up. 

That's not it. I would certainly prefer if the landmine detector
didn't blow up. Not having that happen is certainly a goal I share --
that's why PageGetItemIdCareful() exists. But not at any cost,
especially not when "blow up" means an assertion failure that users
won't actually see in production. Avoiding assertion failures like the
one you showed is likely to have a high cost (removing defensive
asserts in low level access method code) for a low benefit. Any
attempt to avoid having the checker itself blow up rather than throw
an error message needs to be assessed pragmatically, on a case-by-case
basis.

> One of the delays in submitting the most recent version of the patch is that I was having trouble creating a
reliable,portable btree corrupting regression test. 

To be clear, I think that corrupting data is very helpful with ad-hoc
testing during development.

> I did however address (some?) issues that you and others mentioned about the table corrupting regression test.
Perhapsthere are remaining issues that will show up on machines with different endianness than I have thus far tested,
butI don't see that they will be insurmountable.  Are you fundamentally opposed to that test framework? 

I haven't thought about it enough just yet, but I am certainly suspicious of it.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: PG 13 release notes, first draft
Next
From: Dilip Kumar
Date:
Subject: Re: refactoring basebackup.c