Re: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication
Date
Msg-id CAD21AoCrtO8gBORbHVKu3ju9hS953xwWnc1HauwXSejOWji75Q@mail.gmail.com
Whole thread Raw
In response to pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication  (RKN Sai Krishna <rknsaiforpostgres@gmail.com>)
List pgsql-hackers
Hi,

On Wed, Jun 22, 2022 at 2:44 PM RKN Sai Krishna
<rknsaiforpostgres@gmail.com> wrote:
>
> Hi,
>
> Problem: Today when a data page is corrupted in the primary postgres with physical replication (sync or async
standbys),there seems to be no way to repair it easily and we rely on PITR to recreate the postgres server or drop the
corruptedtable (of course this is not an option for important customer tables, but may be okay for some maintenance or
temporarytables). PITR is costly to do in a production environment oftentimes as it involves creation of the full-blown
postgresfrom the base backup and causing downtime for the customers. 
>
> Solution: Make use of the uncorrupted page present in sync or async standby. The proposed tool/extension
pg_page_repair(as we call it) can fetch the uncorrupted page from sync or async standby and overwrite the corrupted one
onthe primary. Yes, there will be a challenge in making sure that the WAL is replayed completely and standby is
up-to-dateso that we are sure that stale pages are not copied across. A simpler idea could be that the pg_page_repair
canwait until the standby replays/catches up with the primary's flush LSN before fetching the uncorrupted page. A
downsideof this approach is that the pg_page_repair waits for long or rather infinitely if the replication lag is huge.
Aswe all know that the replication lag is something a good postgres solution will always monitor to keep it low, if
true,the pg_page_repair is guaranteed to not wait for longer. Another idea could be that the pg_page_repair gets the
basepage from the standby and applies all the WAL records pertaining to the corrupted page using the base page to get
theuncorrupted page. This requires us to pull the replay logic from the core to pg_page_repair which isn't easy. Hence
wepropose to go with approach 1, but open to discuss on approach 2 as well. We suppose that the solution proposed in
thisthread holds good even for pages corresponding to indexes. 

I'm interested in this topic and recalled I did some research on the
first idea while writing experimental code several years ago[1].

The corruption that can be fixed by this feature is mainly physical
corruption, for example, introduced by storage array cache corruption,
array firmware bugs, filesystem bugs, is that right? Logically corrupt
blocks are much more likely to have been introduced as a result of a
failure or a bug in PostgreSQL, which would end up propagating to
physical standbys.

>
> Implementation Choices: pg_page_repair can either take the corrupted page info (db id, rel id, block number etc.) or
justa relation name and automatically figure out the corrupted page using pg_checksums for instance or just database
nameand automatically figure out all the corrupted pages. It can either repair the corrupted pages online (only the
corruptedtable is inaccessible, the server continues to run) or take downtime if there are many corrupted pages. 

Since the server must be shutdown cleanly before running pg_checksums
if we want to verify checksums of the page while the server is running
we would need to do online checksum verification we discussed
before[2].

>
> Future Scope: pg_page_repair can be integrated to the core so that the postgres will repair the pages automatically
withoutmanual intervention. 
>
> Other Solutions: We did consider an approach where the tool could obtain the FPI from WAL and replay till the latest
WALrecord to repair the page. But there could be limitations such as FPI and related WAL not being available in
primary/archivelocation. 

How do we find the FPI of the corrupted page effectively from WAL? We
could seek WAL records from backward but it could take a quite long
time.

Regards,

[1] https://github.com/MasahikoSawada/pgtools/tree/master/page_repair
[2] https://www.postgresql.org/message-id/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Noah Misch
Date:
Subject: Re: Postgres perl module namespace