Re: avoid multiple hard links to same WAL file after a crash - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: avoid multiple hard links to same WAL file after a crash
Date
Msg-id Ym6GZbqQdlalSKSG@paquier.xyz
Whole thread Raw
In response to Re: avoid multiple hard links to same WAL file after a crash  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: avoid multiple hard links to same WAL file after a crash  (Michael Paquier <michael@paquier.xyz>)
Re: avoid multiple hard links to same WAL file after a crash  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
On Tue, Apr 12, 2022 at 09:27:42AM -0700, Nathan Bossart wrote:
> On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:
>> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
>>> I traced this back a while ago.  I believe the link() was first added in
>>> November 2000 as part of f0e37a8.  This even predates WAL recycling, which
>>> was added in July 2001 as part of 7d4d5c0.
>>
>> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
>> somwhere out of the ML.. This patch changed XLogFileInit to
>> supportusing existent files so that XLogWrite can use the new segment
>> provided by checkpoint and still allow XLogWrite to create a new
>> segment by itself.

Yes, I think that you are right here.  I also suspect that the
checkpoint command was facing a concurrency issue while working on
the feature and that Vadim saw that this part of the implementation
would be safer in the long run if we use link() followed by unlink().

> Yeah, I've been unable to find any discussion besides a brief reference to
> adding checkpointing [0].
>
> [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com

While looking at the history of this area, I have also noticed this
argument, telling also that this is a safety measure if this code were
to run in parallel, but that's without counting on the control file
lock hold while doing this operation anyway:
https://www.postgresql.org/message-id/24974.982597735@sss.pgh.pa.us

As mentioned already upthread, f0e37a8 is the origin of the
link()/unlink() business in the WAL segment initialization logic, and
also note 1f159e5 that has added a rename() as extra code path for
systems where link() was not working.

At the end, switching directly from durable_rename_excl() to
durable_rename() should be fine for the WAL segment initialization,
but we could do things a bit more carefully by adding a check on the
file existence before calling durable_rename() and issue a elog(LOG)
if a file is found, giving a mean for the WAL recycling to give up
peacefully as it does now.  Per my analysis, the TLI history file
created at the end of recovery ought to issue an elog(ERROR).

Now, I am surprised by the third code path of durable_rename_excl(),
as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
any issues, as link() should exit with EEXIST when the startup process
grabs the same history file concurrently.  It seems to me that in this
last case using durable_rename() could be an improvement and prevent
extra WAL receiver restarts as a TLI history fetched from the primary
via streaming or from some archives should be the same, but we could
be more careful, like the WAL init logic, by skipping the
durable_rename() and issuing an elog(LOG).  That would not be perfect,
still a bit better than the current state of HEAD.

As we are getting closer to the beta release, it looks safer to let
this change aside a bit longer and wait for v16 to be opened for
business on HEAD.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: testclient.exe installed under MSVC
Next
From: Michael Paquier
Date:
Subject: Re: testclient.exe installed under MSVC