Re: new heapcheck contrib module - Mailing list pgsql-hackers

From Robert Haas
Subject Re: new heapcheck contrib module
Date
Msg-id CA+TgmobgkU0U-XDcpRZvEWb-btOsiZXU3_Uf-15rge+JvqcJsg@mail.gmail.com
Whole thread Raw
In response to Re: new heapcheck contrib module  ("Andrey M. Borodin" <x4mmm@yandex-team.ru>)
Responses Re: new heapcheck contrib module  ("Andrey M. Borodin" <x4mmm@yandex-team.ru>)
List pgsql-hackers
On Fri, Aug 28, 2020 at 1:07 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG
anyhow.
> Will your module be able to gather tid's of similar corruptions?
>
> server/db M # select * from heap_check('pg_toast.pg_toast_4848601');
> ERROR:  58P01: could not access status of transaction 636558742
> DETAIL:  Could not open file "pg_xact/025F": No such file or directory.
> LOCATION:  SlruReportIOError, slru.c:913
> Time: 3439.915 ms (00:03.440)

This kind of thing gets really tricky. PostgreSQL uses errors in tons
of places to report problems, and if you want to accumulate a list of
errors and report them all rather than just letting the first one
cancel the operation, you need special handling for each individual
error you want to bypass. A tool like this naturally wants to use as
much PostgreSQL infrastructure as possible, to avoid duplicating a ton
of code and creating a bloated monstrosity, but all that code can
throw errors. I think the code in its current form is trying to be
resilient against problems on the table pages that it is actually
checking, but it can't necessarily handle gracefully corruption in
other parts of the system. For instance:

- CLOG could be truncated, as in your example
- the disk files could have had their permissions changed so that they
can't be accessed
- the PageIsVerified() check might fail when pages are read
- the TOAST table's metadata in pg_class/pg_attribute/etc. could be corrupted
- ...or the files for those system catalogs could've had their
permissions changed
- ....or they could contain invalid pages
- ...or their indexes could be messed up

I think there are probably a bunch more, and I don't think it's
practical to allow this tool to continue after arbitrary stuff goes
wrong. It'll be too much code and impossible to maintain. In the case
you mention, I think we should view that as a problem with clog rather
than a problem with the table, and thus out of scope.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Clang UndefinedBehaviorSanitize (Postgres14) Detected undefined-behavior
Next
From: Gilles Darold
Date:
Subject: Re: New default role- 'pg_read_all_data'