Re: Fix crash during recovery when redo segment is missing - Mailing list pgsql-hackers
| From | Kirill Reshke |
|---|---|
| Subject | Re: Fix crash during recovery when redo segment is missing |
| Date | |
| Msg-id | CALdSSPieG0wqQnjxyF+U23QQfZfpGnUFvKsJR0pxAbUoY=Z0vA@mail.gmail.com Whole thread Raw |
| In response to | Re: Fix crash during recovery when redo segment is missing (Nitin Jadhav <nitinjadhavpostgres@gmail.com>) |
| Responses |
Re: Fix crash during recovery when redo segment is missing
|
| List | pgsql-hackers |
On Thu, 4 Dec 2025 at 11:37, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote: > > Apologies, I missed attaching the patch earlier. Please find the v2 > version attached. > > Best Regards, > Nitin Jadhav > Azure Database for PostgreSQL > Microsoft > > On Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav > <nitinjadhavpostgres@gmail.com> wrote: > > > > The patch wasn’t applying cleanly on master, so I’ve rebased it and > > also added it to the PG19‑4 CommitFest: > > https://commitfest.postgresql.org/patch/6279/ > > Please review and share your feedback. > > > > Best Regards, > > Nitin Jadhav > > Azure Database for PostgreSQL > > Microsoft > > > > Best Regards, > > Nitin Jadhav > > Azure Database for PostgreSQL > > Microsoft > > > > > > On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav > > <nitinjadhavpostgres@gmail.com> wrote: > > > > > > Hi, > > > > > > In [1], Andres reported a bug where PostgreSQL crashes during recovery > > > if the segment containing the redo pointer does not exist. I have > > > attempted to address this issue and I am sharing a patch for the same. > > > > > > The problem was that PostgreSQL did not PANIC when the redo LSN and > > > checkpoint LSN were in separate segments, and the file containing the > > > redo LSN was missing, leading to a crash. Andres has provided a > > > detailed analysis of the behavior across different settings and > > > versions. Please refer to [1] for more information. This issue arises > > > because PostgreSQL does not PANIC initially. > > > > > > The issue was resolved by ensuring that the REDO location exists once > > > we successfully read the checkpoint record in InitWalRecovery(). This > > > prevents control from reaching PerformWalRecovery() unless the WAL > > > file containing the redo record exists. A new test script, > > > 044_redo_segment_missing.pl, has been added to validate this. To > > > populate the WAL file with a redo record different from the WAL file > > > with the checkpoint record, I wait for the checkpoint start message > > > and then issue a pg_switch_wal(), which should occur before the > > > completion of the checkpoint. Then, I crash the server, and during the > > > restart, it should log an appropriate error indicating that it could > > > not find the redo location. Please let me know if there is a better > > > way to reproduce this behavior. I have tested and verified this with > > > the various scenarios Andres pointed out in [1]. Please note that this > > > patch does not address error checking in StartupXLOG(), > > > CreateCheckPoint(), etc., nor does it focus on cleaning up existing > > > code. > > > > > > Attaching the patch. Please review and share your feedback. Thanks to > > > Andres for spotting the bug and providing the detailed report [1]. > > > > > > [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de > > > > > > Best Regards, > > > Nitin Jadhav > > > Azure Database for PostgreSQL > > > Microsoft Hi! 1) Please do not top-post on these lists 2) I did not get the exact reason this thread is different from [0] 3) Your tests are using a busy loop with usleep which nowadays is considered as bad practice. There 3 of such places, and I think the first two of them can be replaced with injection point wait. > +# Wait for the checkpoint to complete > +my $checkpoint_complete = 0; > +foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) { > + if ($node->log_contains("checkpoint complete", $logstart)) { > + $checkpoint_complete = 1; > + last; > + } > + usleep(100_000); > +} 4) There are comments from Robert, which are not covered [1]. [0] https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de [1] https://www.postgresql.org/message-id/CA%2BTgmob1x7HMcWAb%3D6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg%40mail.gmail.com -- Best regards, Kirill Reshke
pgsql-hackers by date: