Re: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication |
Date | |
Msg-id | CAD21AoCrtO8gBORbHVKu3ju9hS953xwWnc1HauwXSejOWji75Q@mail.gmail.com Whole thread Raw |
In response to | pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication (RKN Sai Krishna <rknsaiforpostgres@gmail.com>) |
List | pgsql-hackers |
Hi, On Wed, Jun 22, 2022 at 2:44 PM RKN Sai Krishna <rknsaiforpostgres@gmail.com> wrote: > > Hi, > > Problem: Today when a data page is corrupted in the primary postgres with physical replication (sync or async standbys),there seems to be no way to repair it easily and we rely on PITR to recreate the postgres server or drop the corruptedtable (of course this is not an option for important customer tables, but may be okay for some maintenance or temporarytables). PITR is costly to do in a production environment oftentimes as it involves creation of the full-blown postgresfrom the base backup and causing downtime for the customers. > > Solution: Make use of the uncorrupted page present in sync or async standby. The proposed tool/extension pg_page_repair(as we call it) can fetch the uncorrupted page from sync or async standby and overwrite the corrupted one onthe primary. Yes, there will be a challenge in making sure that the WAL is replayed completely and standby is up-to-dateso that we are sure that stale pages are not copied across. A simpler idea could be that the pg_page_repair canwait until the standby replays/catches up with the primary's flush LSN before fetching the uncorrupted page. A downsideof this approach is that the pg_page_repair waits for long or rather infinitely if the replication lag is huge. Aswe all know that the replication lag is something a good postgres solution will always monitor to keep it low, if true,the pg_page_repair is guaranteed to not wait for longer. Another idea could be that the pg_page_repair gets the basepage from the standby and applies all the WAL records pertaining to the corrupted page using the base page to get theuncorrupted page. This requires us to pull the replay logic from the core to pg_page_repair which isn't easy. Hence wepropose to go with approach 1, but open to discuss on approach 2 as well. We suppose that the solution proposed in thisthread holds good even for pages corresponding to indexes. I'm interested in this topic and recalled I did some research on the first idea while writing experimental code several years ago[1]. The corruption that can be fixed by this feature is mainly physical corruption, for example, introduced by storage array cache corruption, array firmware bugs, filesystem bugs, is that right? Logically corrupt blocks are much more likely to have been introduced as a result of a failure or a bug in PostgreSQL, which would end up propagating to physical standbys. > > Implementation Choices: pg_page_repair can either take the corrupted page info (db id, rel id, block number etc.) or justa relation name and automatically figure out the corrupted page using pg_checksums for instance or just database nameand automatically figure out all the corrupted pages. It can either repair the corrupted pages online (only the corruptedtable is inaccessible, the server continues to run) or take downtime if there are many corrupted pages. Since the server must be shutdown cleanly before running pg_checksums if we want to verify checksums of the page while the server is running we would need to do online checksum verification we discussed before[2]. > > Future Scope: pg_page_repair can be integrated to the core so that the postgres will repair the pages automatically withoutmanual intervention. > > Other Solutions: We did consider an approach where the tool could obtain the FPI from WAL and replay till the latest WALrecord to repair the page. But there could be limitations such as FPI and related WAL not being available in primary/archivelocation. How do we find the FPI of the corrupted page effectively from WAL? We could seek WAL records from backward but it could take a quite long time. Regards, [1] https://github.com/MasahikoSawada/pgtools/tree/master/page_repair [2] https://www.postgresql.org/message-id/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: