Back-patch of: avoid multiple hard links to same WAL file after a crash - Mailing list pgsql-hackers

From Robert Pang
Subject Back-patch of: avoid multiple hard links to same WAL file after a crash
Date
Msg-id CAJhEC04tBkYPF4q2uS_rCytauvNEVqdBAzasBEokfceFhF=KDQ@mail.gmail.com
Whole thread Raw
In response to Re: avoid multiple hard links to same WAL file after a crash  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
List pgsql-hackers
Dear team,

We recently observed a few cases where Postgres running on Linux
encountered an issue with WAL segment files. Specifically, two WAL
segments were linked to the same physical file after Postgres ran out
of memory and the OOM killer terminated one of its processes. This
resulted in the WAL segments overwriting each other and Postgres
failing a later recovery.

We found this fix [1] that has been applied to Postgres 16, but the
cases we observed were running Postgres 15. Given that older major
versions will be supported for a good number of years, and the
potential for irrecoverability exists (even if rare), we would like to
discuss the possibility of back-patching this fix.

Are there any technical reasons not to back-patch this fix to older
major versions?

Thank you for your consideration.

Sincerely,
Robert Pang

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3

On Sat, May 7, 2022 at 1:19 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
> > I'd agree with removing all the callers at the end.  pgrename() is
> > quite robust on Windows, but I'd keep the two checks in
> > writeTimeLineHistory(), as the logic around findNewestTimeLine() would
> > consider a past TLI history file as in-use even if we have a crash
> > just after the file got created in the same path by the same standby,
> > and the WAL segment init part.  Your patch does that.
>
> As v16 is now open for business, I have revisited this change and
> applied 0001 to change all the callers (aka removal of the assertion
> for the WAL receiver when it overwrites a TLI history file).  The
> commit log includes details about the reasoning of all the areas
> changed, for clarity, as of the WAL recycling part, the TLI history
> file part and basic_archive.
> --
> Michael



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Final result (display) collation?
Next
From: Tom Lane
Date:
Subject: Re: Pg18 Recursive Crash