Re: avoid multiple hard links to same WAL file after a crash - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: avoid multiple hard links to same WAL file after a crash |
Date | |
Msg-id | Ym6GZbqQdlalSKSG@paquier.xyz Whole thread Raw |
In response to | Re: avoid multiple hard links to same WAL file after a crash (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
Re: avoid multiple hard links to same WAL file after a crash
Re: avoid multiple hard links to same WAL file after a crash |
List | pgsql-hackers |
On Tue, Apr 12, 2022 at 09:27:42AM -0700, Nathan Bossart wrote: > On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote: >> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in >>> I traced this back a while ago. I believe the link() was first added in >>> November 2000 as part of f0e37a8. This even predates WAL recycling, which >>> was added in July 2001 as part of 7d4d5c0. >> >> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from >> somwhere out of the ML.. This patch changed XLogFileInit to >> supportusing existent files so that XLogWrite can use the new segment >> provided by checkpoint and still allow XLogWrite to create a new >> segment by itself. Yes, I think that you are right here. I also suspect that the checkpoint command was facing a concurrency issue while working on the feature and that Vadim saw that this part of the implementation would be safer in the long run if we use link() followed by unlink(). > Yeah, I've been unable to find any discussion besides a brief reference to > adding checkpointing [0]. > > [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com While looking at the history of this area, I have also noticed this argument, telling also that this is a safety measure if this code were to run in parallel, but that's without counting on the control file lock hold while doing this operation anyway: https://www.postgresql.org/message-id/24974.982597735@sss.pgh.pa.us As mentioned already upthread, f0e37a8 is the origin of the link()/unlink() business in the WAL segment initialization logic, and also note 1f159e5 that has added a rename() as extra code path for systems where link() was not working. At the end, switching directly from durable_rename_excl() to durable_rename() should be fine for the WAL segment initialization, but we could do things a bit more carefully by adding a check on the file existence before calling durable_rename() and issue a elog(LOG) if a file is found, giving a mean for the WAL recycling to give up peacefully as it does now. Per my analysis, the TLI history file created at the end of recovery ought to issue an elog(ERROR). Now, I am surprised by the third code path of durable_rename_excl(), as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause any issues, as link() should exit with EEXIST when the startup process grabs the same history file concurrently. It seems to me that in this last case using durable_rename() could be an improvement and prevent extra WAL receiver restarts as a TLI history fetched from the primary via streaming or from some archives should be the same, but we could be more careful, like the WAL init logic, by skipping the durable_rename() and issuing an elog(LOG). That would not be perfect, still a bit better than the current state of HEAD. As we are getting closer to the beta release, it looks safer to let this change aside a bit longer and wait for v16 to be opened for business on HEAD. -- Michael
Attachment
pgsql-hackers by date: