Fix crash during recovery when redo segment is missing - Mailing list pgsql-hackers

From Nitin Jadhav
Subject Fix crash during recovery when redo segment is missing
Date
Msg-id CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.

The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.

The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.

Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].

[1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

Attachment

pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: pg_recvlogical requires -d but not described on the documentation
Next
From: Dmitry Dolgov
Date:
Subject: Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning