Re: avoid multiple hard links to same WAL file after a crash - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: avoid multiple hard links to same WAL file after a crash
Date
Msg-id YmoscdSNWdUoKR3A@paquier.xyz
Whole thread Raw
In response to Re: avoid multiple hard links to same WAL file after a crash  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote:
> Here is a new patch set with these assertions added.  I think at least the
> xlog.c change ought to be back-patched.  The problem may be unlikely, but
> AFAICT the possible consequences include WAL corruption.

Okay, so I have applied this stuff this morning to see what the
buildfarm had to say, and we have finished with a set of failures in
various buildfarm members:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-28%2002%3A13%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-04-28%2002%3A14%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-04-28%2002%3A59%3A26

All of them did not like the part where we assume that a TLI history
file written by a WAL receiver should not exist beforehand, but as
025_stuck_on_old_timeline.pl is showing, a standby may attempt to
retrieve a TLI history file after getting it from the archives.

I was analyzing the whole thing, and it looks like a race condition.
Per the the buildfarm logs, we have less than 5ms between the moment
the startup process retrieves the history file of TLI 2 from the
archives and the moment the WAL receiver decides to check if this TLI
file exists.  If it does not exist, it would then retrieve it from the
primary via streaming.  So I guess that the sequence of events is
that:
- In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the
existence of the history file for TLI 2, does not find it.
- The startup process retrieves the file from the archives.
- The WAL receiver goes through the internal loop of
WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the
primary's stream.

Switching from durable_rename_excl() to durable_rename() would mean
that we'd overwrite the TLI file received from the primary stream over
what's been retrieved from the archives.  That does not strike me as
an issue in itself and that should be safe, so the comment is
misleading, and we can live without the assertion in
writeTimeLineHistoryFile() called by the WAL receiver.  Now, I think
that we'd better keep some belts in writeTimeLineHistory() called by
the startup process at the end-of-recovery as I should never ever have
a TLI file generated when selecting a new timeline.  Perhaps this
should be a elog(ERROR) at least, with a check on the file existence
before calling durable_rename()?

Anyway, my time is constrained next week due to the upcoming Japanese
Golden Week and the buildfarm has to be stable, so I have reverted the
change for now.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: bogus: logical replication rows/cols combinations
Next
From: Shinya Kato
Date:
Subject: Re: Add --{no-,}bypassrls flags to createuser