Thread: avoid multiple hard links to same WAL file after a crash

avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
Hi hackers,

I am splitting this off of a previous thread aimed at reducing archiving
overhead [0], as I believe this fix might deserve back-patching.

Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1] demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling.  Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting.  This seems likely to lead to WAL corruption.

The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling.  This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.

This patch also sets the stage for reducing archiving overhead (as
discussed in the other thread [0]).  The proposed change to reduce
archiving overhead will make it more likely that the server will attempt to
re-archive segments after a crash.  This might lead to archive corruption
if the server concurrently writes to the same file via the aforementioned
bug.

[0] https://www.postgresql.org/message-id/20220222011948.GA3850532%40nathanxps13
[1] https://www.postgresql.org/message-id/20220222173711.GA3852671%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Robert Haas
Date:
On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Presently, WAL recycling uses durable_rename_excl(), which notes that a
> crash at an unfortunate moment can result in two links to the same file.
> My testing [1] demonstrated that it was possible to end up with two links
> to the same file in pg_wal after a crash just before unlink() during WAL
> recycling.  Specifically, the test produced links to the same file for the
> current WAL file and the next one because the half-recycled WAL file was
> re-recycled upon restarting.  This seems likely to lead to WAL corruption.

Wow, that's bad.

> The attached patch prevents this problem by using durable_rename() instead
> of durable_rename_excl() for WAL recycling.  This removes the protection
> against accidentally overwriting an existing WAL file, but there shouldn't
> be one.

I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I see that durable_rename_excl() has the following comment: "Similar
> to durable_rename(), except that this routine tries (but does not
> guarantee) not to overwrite the target file." If those are the desired
> semantics, we could achieve them more simply and more safely by just
> trying to stat() the target file and then, if it's not found, call
> durable_rename(). I think that would be a heck of a lot safer than
> what this function is doing right now.

IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
all.  Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().

> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I think there might be another problem.  The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file.  This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths.  My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original.  Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

> Your proposed fix is OK if we don't want to do any of that stuff, but
> personally I'm much more inclined to blame durable_rename_excl() for
> being horrible than I am to blame the calling code for using it
> improvidently.

I do agree that it's worth examining this stuff a bit closer.  I've
frequently found myself trying to reason about all the different states
that callers of these functions can produce, so any changes that help
simplify matters are a win in my book.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote:
> On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
>> I'd actually be in favor of nuking durable_rename_excl() from orbit
>> and putting the file-exists tests in the callers. Otherwise, someone
>> might assume that it actually has the semantics that its name
>> suggests, which could be pretty disastrous. If we don't want to do
>> that, then I'd changing to do the stat-then-durable-rename thing
>> internally, so we don't leave hard links lying around in *any* code
>> path. Perhaps that's the right answer for the back-branches in any
>> case, since there could be third-party code calling this function.
> 
> I think there might be another problem.  The man page for rename() seems to
> indicate that overwriting an existing file also introduces a window where
> the old and new path are hard links to the same file.  This isn't a problem
> for the WAL files because we should never be overwriting an existing one,
> but I wonder if it's a problem for other code paths.  My guess is that many
> code paths that overwrite an existing file are first writing changes to a
> temporary file before atomically replacing the original.  Those paths are
> likely okay, too, as you can usually just discard any existing temporary
> files.

Ha, so there are only a few callers of durable_rename_excl() in the
PostgreSQL tree.  One is basic_archive.c, which is already doing a stat()
check.  IIRC I only used durable_rename_excl() here to handle the case
where multiple servers are writing archives to the same location.  If that
happened, the archiver process would begin failing.  If a crash left two
hard links to the same file around, we will silently succeed the next time
around thanks to the compare_files() check.  Besides the WAL installation
code, the only other callers are in timeline.c, and both note that the use
of durable_rename_excl() is for "paranoidly trying to avoid overwriting an
existing file (there shouldn't be one)."

So AFAICT basic_archive.c is the only caller with a strong reason for using
durable_rename_excl(), and even that might not be worth keeping it around.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I've attached a patch that simply removes durable_rename_excl() and
replaces existing calls with durable_rename().  I noticed that Andres
expressed similar misgivings about durable_rename_excl() last year [0] [1].
I can create a stat-then-durable-rename version of this for back-patching
if that is still the route we want to go.

[0] https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de
[1] https://postgr.es/m/20210318023004.gz2aejhze2kkkqr2%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Robert Haas
Date:
On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> > I see that durable_rename_excl() has the following comment: "Similar
> > to durable_rename(), except that this routine tries (but does not
> > guarantee) not to overwrite the target file." If those are the desired
> > semantics, we could achieve them more simply and more safely by just
> > trying to stat() the target file and then, if it's not found, call
> > durable_rename(). I think that would be a heck of a lot safer than
> > what this function is doing right now.
>
> IIUC it actually does guarantee that you won't overwrite the target file
> when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
> all.  Using stat() before rename() would therefore weaken this check for
> systems with working link(), but it'd probably strengthen it for systems
> without a working link().

Sure, but a guarantee that happens on only some systems isn't worth
much. And, if it comes at the price of potentially having multiple
hard links to the same file in obscure situations, that seems like it
could easily cause more problems than this whole scheme can ever hope
to solve.

> I think there might be another problem.  The man page for rename() seems to
> indicate that overwriting an existing file also introduces a window where
> the old and new path are hard links to the same file.  This isn't a problem
> for the WAL files because we should never be overwriting an existing one,
> but I wonder if it's a problem for other code paths.  My guess is that many
> code paths that overwrite an existing file are first writing changes to a
> temporary file before atomically replacing the original.  Those paths are
> likely okay, too, as you can usually just discard any existing temporary
> files.

I wonder if this is really true. I thought rename() was supposed to be atomic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: avoid multiple hard links to same WAL file after a crash

From
Justin Pryzby
Date:
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
> > I think there might be another problem.  The man page for rename() seems to
> > indicate that overwriting an existing file also introduces a window where
> > the old and new path are hard links to the same file.  This isn't a problem
> > for the WAL files because we should never be overwriting an existing one,
> > but I wonder if it's a problem for other code paths.  My guess is that many
> > code paths that overwrite an existing file are first writing changes to a
> > temporary file before atomically replacing the original.  Those paths are
> > likely okay, too, as you can usually just discard any existing temporary
> > files.
> 
> I wonder if this is really true. I thought rename() was supposed to be atomic.

Looks like it's atomic in that it's not like cp + rm, but not atomic the other
way you want.

|       If  newpath  already  exists, it will be atomically replaced, so that there is no point at which another
processattempting to access newpath will find it missing.  However, there will probably be a window in which
 
|       both oldpath and newpath refer to the file being renamed.



Re: avoid multiple hard links to same WAL file after a crash

From
Kyotaro Horiguchi
Date:
At Thu, 7 Apr 2022 11:29:54 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> The attached patch prevents this problem by using durable_rename() instead
> of durable_rename_excl() for WAL recycling.  This removes the protection
> against accidentally overwriting an existing WAL file, but there shouldn't
> be one.

From another direction, if the new segment was the currently active
one, we just mustn't install it. Otherwise we don't care.

So, the only thing we need to care is segment switch. Without it, the
segment that InstallXLogFileSegment found by the stat loop is known to
be safe to overwrite even if exists.

When segment switch finds an existing file, it's no problem since the
segment switch doesn't create a new segment.  Otherwise segment switch
always calls InstallXLogFileSegment.  The section from searching for
an empty segmetn slot until calling durable_rename_excl() is protected
by ControlFileLock. Thus if a process is in the section, no other
process can switch to a newly-created segment.

If this diagnosis is correct, the comment is proved to be paranoid.


>     * Perform the rename using link if available, paranoidly trying to avoid
>     * overwriting an existing file (there shouldn't be one).

As the result, I think Nathan's fix is correct that we can safely use
durable_rename() instead.

And I propose to use renameat2 on Linux so that we can detect the
contradicting case by the regression tests even though only on Linux.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: avoid multiple hard links to same WAL file after a crash

From
Robert Haas
Date:
On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> So, the only thing we need to care is segment switch. Without it, the
> segment that InstallXLogFileSegment found by the stat loop is known to
> be safe to overwrite even if exists.
>
> When segment switch finds an existing file, it's no problem since the
> segment switch doesn't create a new segment.  Otherwise segment switch
> always calls InstallXLogFileSegment.  The section from searching for
> an empty segmetn slot until calling durable_rename_excl() is protected
> by ControlFileLock. Thus if a process is in the section, no other
> process can switch to a newly-created segment.
>
> If this diagnosis is correct, the comment is proved to be paranoid.

It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: avoid multiple hard links to same WAL file after a crash

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>> If this diagnosis is correct, the comment is proved to be paranoid.

> It's sometimes difficult to understand what problems really old code
> comments are worrying about. For example, could they have been
> worrying about bugs in the code? Could they have been worrying about
> manual interference with the pg_wal directory? It's hard to know.

"git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)

            regards, tom lane



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
>> <horikyota.ntt@gmail.com> wrote:
>>> If this diagnosis is correct, the comment is proved to be paranoid.
> 
>> It's sometimes difficult to understand what problems really old code
>> comments are worrying about. For example, could they have been
>> worrying about bugs in the code? Could they have been worrying about
>> manual interference with the pg_wal directory? It's hard to know.
> 
> "git blame" can be helpful here, if you trace back to when the comment
> was written and then try to find the associated mailing-list discussion.
> (That leap can be difficult for commits pre-dating our current
> convention of including links in the commit message, but it's usually
> not *that* hard to locate contemporaneous discussion.)

I traced this back a while ago.  I believe the link() was first added in
November 2000 as part of f0e37a8.  This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Kyotaro Horiguchi
Date:
At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
> >> <horikyota.ntt@gmail.com> wrote:
> >>> If this diagnosis is correct, the comment is proved to be paranoid.
> > 
> >> It's sometimes difficult to understand what problems really old code
> >> comments are worrying about. For example, could they have been
> >> worrying about bugs in the code? Could they have been worrying about
> >> manual interference with the pg_wal directory? It's hard to know.
> > 
> > "git blame" can be helpful here, if you trace back to when the comment
> > was written and then try to find the associated mailing-list discussion.
> > (That leap can be difficult for commits pre-dating our current
> > convention of including links in the commit message, but it's usually
> > not *that* hard to locate contemporaneous discussion.)
> 
> I traced this back a while ago.  I believe the link() was first added in
> November 2000 as part of f0e37a8.  This even predates WAL recycling, which
> was added in July 2001 as part of 7d4d5c0.

f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
somwhere out of the ML.. This patch changed XLogFileInit to
supportusing existent files so that XLogWrite can use the new segment
provided by checkpoint and still allow XLogWrite to create a new
segment by itself.

Just before the commit, calls to XLogFileInit were protected (or
serialized) by logwr_lck.  At the commit calls to the same function
were still serialized by ControlFileLockId.

I *guess* that Vadim faced/noticed a race condition when he added
checkpoint. Thus introduced the link+remove protocol but finally it
became useless by moving the call to XLogFileInit within
ControlFileLockId section.  But, of course, all of story is mere a
guess.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
>> I traced this back a while ago.  I believe the link() was first added in
>> November 2000 as part of f0e37a8.  This even predates WAL recycling, which
>> was added in July 2001 as part of 7d4d5c0.
> 
> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
> somwhere out of the ML.. This patch changed XLogFileInit to
> supportusing existent files so that XLogWrite can use the new segment
> provided by checkpoint and still allow XLogWrite to create a new
> segment by itself.

Yeah, I've been unable to find any discussion besides a brief reference to
adding checkpointing [0].

[0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
> On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I think there might be another problem.  The man page for rename() seems to
>> indicate that overwriting an existing file also introduces a window where
>> the old and new path are hard links to the same file.  This isn't a problem
>> for the WAL files because we should never be overwriting an existing one,
>> but I wonder if it's a problem for other code paths.  My guess is that many
>> code paths that overwrite an existing file are first writing changes to a
>> temporary file before atomically replacing the original.  Those paths are
>> likely okay, too, as you can usually just discard any existing temporary
>> files.
>
> I wonder if this is really true. I thought rename() was supposed to be atomic.

Not always.  For example, some old versions of MacOS have a non-atomic
implementation of rename(), like prairiedog with 10.4.  Even 10.5 does
not handle atomicity as far as I call.  In short, it looks like a bad
idea to me to rely on this idea at all.  Some FSes have their own way
of handling things, as well, but I am not much into this world.

Saying that, it would be nice to see durable_rename_excl() gone as it
has created quite a bit of pain for us in the past years.
--
Michael

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Mon, Apr 18, 2022 at 04:48:35PM +0900, Michael Paquier wrote:
> Saying that, it would be nice to see durable_rename_excl() gone as it
> has created quite a bit of pain for us in the past years.

Yeah, I think this is the right thing to do.  Patch upthread [0].

For back-branches, I suspect we'll want to remove all uses of
durable_rename_excl() but leave the function around for any extensions that
are using it.  Of course, we'd also need a big comment imploring folks not
to add any more callers.  Another option would be to change the behavior of
durable_rename_excl() to something that we think is safer (e.g., stat then
rename), but that might just introduce a different set of problems.

[0] https://postgr.es/m/20220408194345.GA1541826%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
>> I wonder if this is really true. I thought rename() was supposed to be atomic.

> Not always.  For example, some old versions of MacOS have a non-atomic
> implementation of rename(), like prairiedog with 10.4.  Even 10.5 does
> not handle atomicity as far as I call.

I think that's not talking about the same thing.  POSIX requires rename(2)
to replace an existing target link atomically:

    If the link named by the new argument exists, it shall be removed and
    old renamed to new. In this case, a link named new shall remain
    visible to other threads throughout the renaming operation and refer
    either to the file referred to by new or old before the operation
    began.

(It's that requirement that ancient macOS fails to meet.)

However, I do not see any text that addresses the question of whether
the old link disappears atomically with the appearance of the new link,
and it seems like that'd be pretty impractical to ensure in cases like
moving a link from one directory to another.  (What would it even mean
to say that, considering that a thread can't read the two directories
at the same instant?)  From a crash-safety standpoint, it'd surely be
better to make the new link before removing the old, so I imagine
that's what most file systems do.

            regards, tom lane



Re: avoid multiple hard links to same WAL file after a crash

From
Greg Stark
Date:
The readdir interface allows processes to be in the middle of reading
a directory and unless a kernel was happy to either materialize the
entire directory list when the readdir starts, or lock the entire
directory against modification for the entire time the a process has a
readdir fd open it's always going to be possible for the a process to
have previously read the old directory entry and later see the new
directory entry. Kernels don't do any MVCC or cmin type of games so
they're not going to be able to prevent it.

What's worse of course is that it may only happen in very large
directories. Most directories fit on a single block and readdir may
buffer up all the entries a block at a time for efficiency. So it may
only be visible on very large directories that span multiple blocks.



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
Here is an attempt at creating something that can be back-patched.  0001
simply replaces calls to durable_rename_excl() with durable_rename() and is
intended to be back-patched.  0002 removes the definition of
durable_rename_excl() and is _not_ intended for back-patching.  I imagine
0002 will need to be held back for v16devel.

I think back-patching 0001 will encounter a couple of small obstacles.  For
example, the call in basic_archive won't exist on most of the
back-branches, and durable_rename_excl() was named durable_link_or_rename()
before v13.  I don't mind producing a patch for each back-branch if needed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote:
> Here is an attempt at creating something that can be back-patched.  0001
> simply replaces calls to durable_rename_excl() with durable_rename() and is
> intended to be back-patched.  0002 removes the definition of
> durable_rename_excl() and is _not_ intended for back-patching.  I imagine
> 0002 will need to be held back for v16devel.

I would not mind applying 0002 on HEAD now to avoid more uses of this
API, and I can get behind 0001 after thinking more about it.

> I think back-patching 0001 will encounter a couple of small obstacles.  For
> example, the call in basic_archive won't exist on most of the
> back-branches, and durable_rename_excl() was named durable_link_or_rename()
> before v13.  I don't mind producing a patch for each back-branch if needed.

I am not sure that have any need to backpatch this change based on the
unlikeliness of the problem, TBH.  One thing that is itching me a bit,
like Robert upthread, is that we don't check anymore that the newfile
does not exist in the code paths because we never expect one.  It is
possible to use stat() for that.  But access() within a simple
assertion would be simpler?  Say something like:
Assert(access(path, F_OK) != 0 && errno == ENOENT);

The case for basic_archive is limited as the comment of the patch
states, but that would be helpful for the two calls in timeline.c and
the one in xlog.c in the long-term.  And this has no need to be part
of fd.c, this can be added before the durable_rename() calls.  What do
you think?
--
Michael

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote:
> I am not sure that have any need to backpatch this change based on the
> unlikeliness of the problem, TBH.  One thing that is itching me a bit,
> like Robert upthread, is that we don't check anymore that the newfile
> does not exist in the code paths because we never expect one.  It is
> possible to use stat() for that.  But access() within a simple
> assertion would be simpler?  Say something like:
> Assert(access(path, F_OK) != 0 && errno == ENOENT);
> 
> The case for basic_archive is limited as the comment of the patch
> states, but that would be helpful for the two calls in timeline.c and
> the one in xlog.c in the long-term.  And this has no need to be part
> of fd.c, this can be added before the durable_rename() calls.  What do
> you think?

Here is a new patch set with these assertions added.  I think at least the
xlog.c change ought to be back-patched.  The problem may be unlikely, but
AFAICT the possible consequences include WAL corruption.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote:
> Here is a new patch set with these assertions added.  I think at least the
> xlog.c change ought to be back-patched.  The problem may be unlikely, but
> AFAICT the possible consequences include WAL corruption.

Okay, so I have applied this stuff this morning to see what the
buildfarm had to say, and we have finished with a set of failures in
various buildfarm members:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-28%2002%3A13%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-04-28%2002%3A14%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-04-28%2002%3A59%3A26

All of them did not like the part where we assume that a TLI history
file written by a WAL receiver should not exist beforehand, but as
025_stuck_on_old_timeline.pl is showing, a standby may attempt to
retrieve a TLI history file after getting it from the archives.

I was analyzing the whole thing, and it looks like a race condition.
Per the the buildfarm logs, we have less than 5ms between the moment
the startup process retrieves the history file of TLI 2 from the
archives and the moment the WAL receiver decides to check if this TLI
file exists.  If it does not exist, it would then retrieve it from the
primary via streaming.  So I guess that the sequence of events is
that:
- In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the
existence of the history file for TLI 2, does not find it.
- The startup process retrieves the file from the archives.
- The WAL receiver goes through the internal loop of
WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the
primary's stream.

Switching from durable_rename_excl() to durable_rename() would mean
that we'd overwrite the TLI file received from the primary stream over
what's been retrieved from the archives.  That does not strike me as
an issue in itself and that should be safe, so the comment is
misleading, and we can live without the assertion in
writeTimeLineHistoryFile() called by the WAL receiver.  Now, I think
that we'd better keep some belts in writeTimeLineHistory() called by
the startup process at the end-of-recovery as I should never ever have
a TLI file generated when selecting a new timeline.  Perhaps this
should be a elog(ERROR) at least, with a check on the file existence
before calling durable_rename()?

Anyway, my time is constrained next week due to the upcoming Japanese
Golden Week and the buildfarm has to be stable, so I have reverted the
change for now.
--
Michael

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Tue, Apr 12, 2022 at 09:27:42AM -0700, Nathan Bossart wrote:
> On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:
>> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
>>> I traced this back a while ago.  I believe the link() was first added in
>>> November 2000 as part of f0e37a8.  This even predates WAL recycling, which
>>> was added in July 2001 as part of 7d4d5c0.
>>
>> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
>> somwhere out of the ML.. This patch changed XLogFileInit to
>> supportusing existent files so that XLogWrite can use the new segment
>> provided by checkpoint and still allow XLogWrite to create a new
>> segment by itself.

Yes, I think that you are right here.  I also suspect that the
checkpoint command was facing a concurrency issue while working on
the feature and that Vadim saw that this part of the implementation
would be safer in the long run if we use link() followed by unlink().

> Yeah, I've been unable to find any discussion besides a brief reference to
> adding checkpointing [0].
>
> [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com

While looking at the history of this area, I have also noticed this
argument, telling also that this is a safety measure if this code were
to run in parallel, but that's without counting on the control file
lock hold while doing this operation anyway:
https://www.postgresql.org/message-id/24974.982597735@sss.pgh.pa.us

As mentioned already upthread, f0e37a8 is the origin of the
link()/unlink() business in the WAL segment initialization logic, and
also note 1f159e5 that has added a rename() as extra code path for
systems where link() was not working.

At the end, switching directly from durable_rename_excl() to
durable_rename() should be fine for the WAL segment initialization,
but we could do things a bit more carefully by adding a check on the
file existence before calling durable_rename() and issue a elog(LOG)
if a file is found, giving a mean for the WAL recycling to give up
peacefully as it does now.  Per my analysis, the TLI history file
created at the end of recovery ought to issue an elog(ERROR).

Now, I am surprised by the third code path of durable_rename_excl(),
as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
any issues, as link() should exit with EEXIST when the startup process
grabs the same history file concurrently.  It seems to me that in this
last case using durable_rename() could be an improvement and prevent
extra WAL receiver restarts as a TLI history fetched from the primary
via streaming or from some archives should be the same, but we could
be more careful, like the WAL init logic, by skipping the
durable_rename() and issuing an elog(LOG).  That would not be perfect,
still a bit better than the current state of HEAD.

As we are getting closer to the beta release, it looks safer to let
this change aside a bit longer and wait for v16 to be opened for
business on HEAD.
--
Michael

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote:
> Now, I am surprised by the third code path of durable_rename_excl(),
> as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
> any issues, as link() should exit with EEXIST when the startup process
> grabs the same history file concurrently.  It seems to me that in this
> last case using durable_rename() could be an improvement and prevent
> extra WAL receiver restarts as a TLI history fetched from the primary
> via streaming or from some archives should be the same, but we could
> be more careful, like the WAL init logic, by skipping the
> durable_rename() and issuing an elog(LOG).  That would not be perfect,
> still a bit better than the current state of HEAD.

Skimming through at the buildfarm logs, it happens that the tests are
able to see this race from time to time.  Here is one such example on
rorqual:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check

And here are the relevant logs:
2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG:  restored log
file "00000002.history" from archive
2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG:  fetching
timeline history file for timeline 2 from primary server
2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL:  could
not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history":
File exists
[...]
2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG:  started
streaming WAL from primary at 0/3000000 on timeline 2

The WAL receiver upgrades the ERROR to a FATAL, and restarts
streaming shortly after.  Using durable_rename() would not be an issue
here.
--
Michael

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote:
> At the end, switching directly from durable_rename_excl() to
> durable_rename() should be fine for the WAL segment initialization,
> but we could do things a bit more carefully by adding a check on the
> file existence before calling durable_rename() and issue a elog(LOG)
> if a file is found, giving a mean for the WAL recycling to give up
> peacefully as it does now.  Per my analysis, the TLI history file
> created at the end of recovery ought to issue an elog(ERROR).

My only concern with this approach is that it inevitably introduces a race
condition.  In most cases, the file existence check will prevent
overwrites, but it might not always.  Furthermore, we believe that such
overwrites either 1) should not happen (e.g., WAL recycling) or 2) won't
cause problems if they happen (e.g., when the WAL receiver writes the TLI
history file).  Also, these races will be difficult to test, so we won't
know what breaks when they occur.

My instinct is to just let the overwrites happen.  That way, we are more
likely to catch breakage in tests, and we'll have one less race condition
to worry about.  I don't mind asserting that the file doesn't exist when we
don't expect it to, as that might help catch potential problems in
development without affecting behavior in production.  If we do want to
add file existence checks, I think we'd better add a comment about the
potential for race conditions.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote:
> Skimming through at the buildfarm logs, it happens that the tests are
> able to see this race from time to time.  Here is one such example on
> rorqual:
>
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check
> 
> And here are the relevant logs:
> 2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG:  restored log
> file "00000002.history" from archive
> 2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG:  fetching
> timeline history file for timeline 2 from primary server
> 2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL:  could
> not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history":
> File exists
> [...]
> 2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG:  started
> streaming WAL from primary at 0/3000000 on timeline 2
> 
> The WAL receiver upgrades the ERROR to a FATAL, and restarts
> streaming shortly after.  Using durable_rename() would not be an issue
> here.

Thanks for investigating this one.  I think I agree that we should simply
switch to durable_rename() (without a file existence check beforehand).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Mon, May 02, 2022 at 10:39:07AM -0700, Nathan Bossart wrote:
> On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote:
>> The WAL receiver upgrades the ERROR to a FATAL, and restarts
>> streaming shortly after.  Using durable_rename() would not be an issue
>> here.
> 
> Thanks for investigating this one.  I think I agree that we should simply
> switch to durable_rename() (without a file existence check beforehand).

Here is a new patch set.  For now, I've only removed the file existence
check in writeTimeLineHistoryFile().  I don't know if I'm totally convinced
that there isn't a problem here (e.g., due to concurrent .ready file
creation), but since some platforms have been using rename() for some time,
I don't know how worried we should be.  I thought about adding some kind of
locking between the WAL receiver and startup processes, but that seems
excessive.  Alternatively, we could just fix xlog.c as proposed earlier
[0].  AFAICT that is the only caller that can experience problems due to
the multiple-hard-link issue.  All other callers are simply renaming a
temporary file into place, and the temporary file can be discarded if left
behind after a crash.

[0] https://postgr.es/m/20220407182954.GA1231544%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Mon, May 02, 2022 at 04:06:13PM -0700, Nathan Bossart wrote:
> Here is a new patch set.  For now, I've only removed the file existence
> check in writeTimeLineHistoryFile().  I don't know if I'm totally convinced
> that there isn't a problem here (e.g., due to concurrent .ready file
> creation), but since some platforms have been using rename() for some time,
> I don't know how worried we should be.

That's only about Windows these days, meaning that there is much less
coverage in this code path.

> I thought about adding some kind of
> locking between the WAL receiver and startup processes, but that seems
> excessive.

Agreed.

> Alternatively, we could just fix xlog.c as proposed earlier
> [0].  AFAICT that is the only caller that can experience problems due to
> the multiple-hard-link issue.  All other callers are simply renaming a
> temporary file into place, and the temporary file can be discarded if left
> behind after a crash.

I'd agree with removing all the callers at the end.  pgrename() is
quite robust on Windows, but I'd keep the two checks in
writeTimeLineHistory(), as the logic around findNewestTimeLine() would
consider a past TLI history file as in-use even if we have a crash
just after the file got created in the same path by the same standby,
and the WAL segment init part.  Your patch does that.
--
Michael

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
> I'd agree with removing all the callers at the end.  pgrename() is
> quite robust on Windows, but I'd keep the two checks in
> writeTimeLineHistory(), as the logic around findNewestTimeLine() would
> consider a past TLI history file as in-use even if we have a crash
> just after the file got created in the same path by the same standby,
> and the WAL segment init part.  Your patch does that.

As v16 is now open for business, I have revisited this change and
applied 0001 to change all the callers (aka removal of the assertion
for the WAL receiver when it overwrites a TLI history file).  The
commit log includes details about the reasoning of all the areas
changed, for clarity, as of the WAL recycling part, the TLI history
file part and basic_archive.
--
Michael

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Tue, Jul 05, 2022 at 10:19:49AM +0900, Michael Paquier wrote:
> On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
>> I'd agree with removing all the callers at the end.  pgrename() is
>> quite robust on Windows, but I'd keep the two checks in
>> writeTimeLineHistory(), as the logic around findNewestTimeLine() would
>> consider a past TLI history file as in-use even if we have a crash
>> just after the file got created in the same path by the same standby,
>> and the WAL segment init part.  Your patch does that.
> 
> As v16 is now open for business, I have revisited this change and
> applied 0001 to change all the callers (aka removal of the assertion
> for the WAL receiver when it overwrites a TLI history file).  The
> commit log includes details about the reasoning of all the areas
> changed, for clarity, as of the WAL recycling part, the TLI history
> file part and basic_archive. 

Thanks!  I wonder if we should add a comment in writeTimeLineHistoryFile()
about possible concurrent use by a WAL receiver and the startup process and
why that is okay.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: avoid multiple hard links to same WAL file after a crash

From
Michael Paquier
Date:
On Tue, Jul 05, 2022 at 09:58:38AM -0700, Nathan Bossart wrote:
> Thanks!  I wonder if we should add a comment in writeTimeLineHistoryFile()
> about possible concurrent use by a WAL receiver and the startup process and
> why that is okay.

Agreed.  Adding an extra note at the top of the routine would help in
the future.
--
Michael

Attachment
Dear team,

We recently observed a few cases where Postgres running on Linux
encountered an issue with WAL segment files. Specifically, two WAL
segments were linked to the same physical file after Postgres ran out
of memory and the OOM killer terminated one of its processes. This
resulted in the WAL segments overwriting each other and Postgres
failing a later recovery.

We found this fix [1] that has been applied to Postgres 16, but the
cases we observed were running Postgres 15. Given that older major
versions will be supported for a good number of years, and the
potential for irrecoverability exists (even if rare), we would like to
discuss the possibility of back-patching this fix.

Are there any technical reasons not to back-patch this fix to older
major versions?

Thank you for your consideration.

Sincerely,
Robert Pang

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3

On Sat, May 7, 2022 at 1:19 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
> > I'd agree with removing all the callers at the end.  pgrename() is
> > quite robust on Windows, but I'd keep the two checks in
> > writeTimeLineHistory(), as the logic around findNewestTimeLine() would
> > consider a past TLI history file as in-use even if we have a crash
> > just after the file got created in the same path by the same standby,
> > and the WAL segment init part.  Your patch does that.
>
> As v16 is now open for business, I have revisited this change and
> applied 0001 to change all the callers (aka removal of the assertion
> for the WAL receiver when it overwrites a TLI history file).  The
> commit log includes details about the reasoning of all the areas
> changed, for clarity, as of the WAL recycling part, the TLI history
> file part and basic_archive.
> --
> Michael



Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:
> We recently observed a few cases where Postgres running on Linux
> encountered an issue with WAL segment files. Specifically, two WAL
> segments were linked to the same physical file after Postgres ran out
> of memory and the OOM killer terminated one of its processes. This
> resulted in the WAL segments overwriting each other and Postgres
> failing a later recovery.

Yikes!

> We found this fix [1] that has been applied to Postgres 16, but the
> cases we observed were running Postgres 15. Given that older major
> versions will be supported for a good number of years, and the
> potential for irrecoverability exists (even if rare), we would like to
> discuss the possibility of back-patching this fix.

IMHO this is a good time to reevaluate.  It looks like we originally didn't
back-patch out of an abundance of caution, but now that this one has had
time to bake, I think it's worth seriously considering, especially now that
we have a report from the field.

-- 
nathan



Hi,

On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote:
> On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:
> > We recently observed a few cases where Postgres running on Linux
> > encountered an issue with WAL segment files. Specifically, two WAL
> > segments were linked to the same physical file after Postgres ran out
> > of memory and the OOM killer terminated one of its processes. This
> > resulted in the WAL segments overwriting each other and Postgres
> > failing a later recovery.
>
> Yikes!

Indeed.  As chance would have it, I was asked for input on a corrupted server
*today*. Eventually we found that recovery stopped early, after encountering a
segment with a *newer* pageaddr than we expected. Which made me think of this
issue, and indeed, the file recovery stopped at had two links.  Before that
the server had been crashing on a regular basis for unrelated reasons, which
presumably increased the chances sufficiently to eventually hit this problem.


It's a normal thing to discover the end of the WAL by finding a segment that
has an older pageaddr than its name suggests. But in this case we saw a newer
page address.  I wonder if we should treat that differently...


> > We found this fix [1] that has been applied to Postgres 16, but the
> > cases we observed were running Postgres 15. Given that older major
> > versions will be supported for a good number of years, and the
> > potential for irrecoverability exists (even if rare), we would like to
> > discuss the possibility of back-patching this fix.
>
> IMHO this is a good time to reevaluate.  It looks like we originally didn't
> back-patch out of an abundance of caution, but now that this one has had
> time to bake, I think it's worth seriously considering, especially now that
> we have a report from the field.

Strongly agreed.

I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message.  The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server.  And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.

Greetings,

Andres Freund



Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From
Nathan Bossart
Date:
On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> I've been double-checking the code to refresh myself with the problem,
> and I don't see a reason to not apply something like the attached set
> down to v13 for all these remaining branches (minus an edit of the
> commit message).

LGTM

-- 
nathan



On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
> On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> > I've been double-checking the code to refresh myself with the problem,
> > and I don't see a reason to not apply something like the attached set
> > down to v13 for all these remaining branches (minus an edit of the
> > commit message).
> 
> LGTM

Dito.



Dear Michael,

Thank you for applying this back-patch. I also appreciate everyone's input on this issue.

Sincerely,
Robert Pang



On Thu, Dec 19, 2024 at 4:13 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:
> On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
>> LGTM
>
> Dito.

Thanks for double-checking.  Done this one.
--
Michael