Thread: avoid multiple hard links to same WAL file after a crash

avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

07 April 2022, 18:29:54

Hi hackers,

I am splitting this off of a previous thread aimed at reducing archiving
overhead [0], as I believe this fix might deserve back-patching.

Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1] demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling.  Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting.  This seems likely to lead to WAL corruption.

The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling.  This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.

This patch also sets the stage for reducing archiving overhead (as
discussed in the other thread [0]).  The proposed change to reduce
archiving overhead will make it more likely that the server will attempt to
re-archive segments after a crash.  This might lead to archive corruption
if the server concurrently writes to the same file via the aforementioned
bug.

[0] https://www.postgresql.org/message-id/20220222011948.GA3850532%40nathanxps13
[1] https://www.postgresql.org/message-id/20220222173711.GA3852671%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Avoid-multiple-hard-links-to-same-WAL-file-after-.patch

Re: avoid multiple hard links to same WAL file after a crash

From

Robert Haas

Date:

08 April 2022, 14:38:03

On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Presently, WAL recycling uses durable_rename_excl(), which notes that a
> crash at an unfortunate moment can result in two links to the same file.
> My testing [1] demonstrated that it was possible to end up with two links
> to the same file in pg_wal after a crash just before unlink() during WAL
> recycling.  Specifically, the test produced links to the same file for the
> current WAL file and the next one because the half-recycled WAL file was
> re-recycled upon restarting.  This seems likely to lead to WAL corruption.

Wow, that's bad.

> The attached patch prevents this problem by using durable_rename() instead
> of durable_rename_excl() for WAL recycling.  This removes the protection
> against accidentally overwriting an existing WAL file, but there shouldn't
> be one.

I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

08 April 2022, 16:53:12

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I see that durable_rename_excl() has the following comment: "Similar
> to durable_rename(), except that this routine tries (but does not
> guarantee) not to overwrite the target file." If those are the desired
> semantics, we could achieve them more simply and more safely by just
> trying to stat() the target file and then, if it's not found, call
> durable_rename(). I think that would be a heck of a lot safer than
> what this function is doing right now.

IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
all.  Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().

> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I think there might be another problem.  The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file.  This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths.  My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original.  Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

> Your proposed fix is OK if we don't want to do any of that stuff, but
> personally I'm much more inclined to blame durable_rename_excl() for
> being horrible than I am to blame the calling code for using it
> improvidently.

I do agree that it's worth examining this stuff a bit closer.  I've
frequently found myself trying to reason about all the different states
that callers of these functions can produce, so any changes that help
simplify matters are a win in my book.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

08 April 2022, 17:05:19

On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote:
> On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
>> I'd actually be in favor of nuking durable_rename_excl() from orbit
>> and putting the file-exists tests in the callers. Otherwise, someone
>> might assume that it actually has the semantics that its name
>> suggests, which could be pretty disastrous. If we don't want to do
>> that, then I'd changing to do the stat-then-durable-rename thing
>> internally, so we don't leave hard links lying around in *any* code
>> path. Perhaps that's the right answer for the back-branches in any
>> case, since there could be third-party code calling this function.
> 
> I think there might be another problem.  The man page for rename() seems to
> indicate that overwriting an existing file also introduces a window where
> the old and new path are hard links to the same file.  This isn't a problem
> for the WAL files because we should never be overwriting an existing one,
> but I wonder if it's a problem for other code paths.  My guess is that many
> code paths that overwrite an existing file are first writing changes to a
> temporary file before atomically replacing the original.  Those paths are
> likely okay, too, as you can usually just discard any existing temporary
> files.

Ha, so there are only a few callers of durable_rename_excl() in the
PostgreSQL tree.  One is basic_archive.c, which is already doing a stat()
check.  IIRC I only used durable_rename_excl() here to handle the case
where multiple servers are writing archives to the same location.  If that
happened, the archiver process would begin failing.  If a crash left two
hard links to the same file around, we will silently succeed the next time
around thanks to the compare_files() check.  Besides the WAL installation
code, the only other callers are in timeline.c, and both note that the use
of durable_rename_excl() is for "paranoidly trying to avoid overwriting an
existing file (there shouldn't be one)."

So AFAICT basic_archive.c is the only caller with a strong reason for using
durable_rename_excl(), and even that might not be worth keeping it around.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

08 April 2022, 19:43:45

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I've attached a patch that simply removes durable_rename_excl() and
replaces existing calls with durable_rename().  I noticed that Andres
expressed similar misgivings about durable_rename_excl() last year [0] [1].
I can create a stat-then-durable-rename version of this for back-patching
if that is still the route we want to go.

[0] https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de
[1] https://postgr.es/m/20210318023004.gz2aejhze2kkkqr2%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

v2-0001-Remove-durable_rename_excl.patch

Re: avoid multiple hard links to same WAL file after a crash

From

Robert Haas

Date:

09 April 2022, 01:00:36

On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> > I see that durable_rename_excl() has the following comment: "Similar
> > to durable_rename(), except that this routine tries (but does not
> > guarantee) not to overwrite the target file." If those are the desired
> > semantics, we could achieve them more simply and more safely by just
> > trying to stat() the target file and then, if it's not found, call
> > durable_rename(). I think that would be a heck of a lot safer than
> > what this function is doing right now.
>
> IIUC it actually does guarantee that you won't overwrite the target file
> when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
> all.  Using stat() before rename() would therefore weaken this check for
> systems with working link(), but it'd probably strengthen it for systems
> without a working link().

Sure, but a guarantee that happens on only some systems isn't worth
much. And, if it comes at the price of potentially having multiple
hard links to the same file in obscure situations, that seems like it
could easily cause more problems than this whole scheme can ever hope
to solve.

> I think there might be another problem.  The man page for rename() seems to
> indicate that overwriting an existing file also introduces a window where
> the old and new path are hard links to the same file.  This isn't a problem
> for the WAL files because we should never be overwriting an existing one,
> but I wonder if it's a problem for other code paths.  My guess is that many
> code paths that overwrite an existing file are first writing changes to a
> temporary file before atomically replacing the original.  Those paths are
> likely okay, too, as you can usually just discard any existing temporary
> files.

I wonder if this is really true. I thought rename() was supposed to be atomic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: avoid multiple hard links to same WAL file after a crash

From

Justin Pryzby

Date:

09 April 2022, 01:10:38

On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
> > I think there might be another problem.  The man page for rename() seems to
> > indicate that overwriting an existing file also introduces a window where
> > the old and new path are hard links to the same file.  This isn't a problem
> > for the WAL files because we should never be overwriting an existing one,
> > but I wonder if it's a problem for other code paths.  My guess is that many
> > code paths that overwrite an existing file are first writing changes to a
> > temporary file before atomically replacing the original.  Those paths are
> > likely okay, too, as you can usually just discard any existing temporary
> > files.
> 
> I wonder if this is really true. I thought rename() was supposed to be atomic.

Looks like it's atomic in that it's not like cp + rm, but not atomic the other
way you want.

|       If  newpath  already  exists, it will be atomically replaced, so that there is no point at which another
processattempting to access newpath will find it missing.  However, there will probably be a window in which

|       both oldpath and newpath refer to the file being renamed.

Re: avoid multiple hard links to same WAL file after a crash

From

Kyotaro Horiguchi

Date:

11 April 2022, 09:12:17

At Thu, 7 Apr 2022 11:29:54 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> The attached patch prevents this problem by using durable_rename() instead
> of durable_rename_excl() for WAL recycling.  This removes the protection
> against accidentally overwriting an existing WAL file, but there shouldn't
> be one.

From another direction, if the new segment was the currently active
one, we just mustn't install it. Otherwise we don't care.

So, the only thing we need to care is segment switch. Without it, the
segment that InstallXLogFileSegment found by the stat loop is known to
be safe to overwrite even if exists.

When segment switch finds an existing file, it's no problem since the
segment switch doesn't create a new segment.  Otherwise segment switch
always calls InstallXLogFileSegment.  The section from searching for
an empty segmetn slot until calling durable_rename_excl() is protected
by ControlFileLock. Thus if a process is in the section, no other
process can switch to a newly-created segment.

If this diagnosis is correct, the comment is proved to be paranoid.


>     * Perform the rename using link if available, paranoidly trying to avoid
>     * overwriting an existing file (there shouldn't be one).

As the result, I think Nathan's fix is correct that we can safely use
durable_rename() instead.

And I propose to use renameat2 on Linux so that we can detect the
contradicting case by the regression tests even though only on Linux.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: avoid multiple hard links to same WAL file after a crash

From

Robert Haas

Date:

11 April 2022, 16:15:40

On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> So, the only thing we need to care is segment switch. Without it, the
> segment that InstallXLogFileSegment found by the stat loop is known to
> be safe to overwrite even if exists.
>
> When segment switch finds an existing file, it's no problem since the
> segment switch doesn't create a new segment.  Otherwise segment switch
> always calls InstallXLogFileSegment.  The section from searching for
> an empty segmetn slot until calling durable_rename_excl() is protected
> by ControlFileLock. Thus if a process is in the section, no other
> process can switch to a newly-created segment.
>
> If this diagnosis is correct, the comment is proved to be paranoid.

It's sometimes difficult to understand what problems really old code
comments are worrying about. For example, could they have been
worrying about bugs in the code? Could they have been worrying about
manual interference with the pg_wal directory? It's hard to know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: avoid multiple hard links to same WAL file after a crash

From

Tom Lane

Date:

11 April 2022, 16:28:47

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>> If this diagnosis is correct, the comment is proved to be paranoid.

> It's sometimes difficult to understand what problems really old code
> comments are worrying about. For example, could they have been
> worrying about bugs in the code? Could they have been worrying about
> manual interference with the pg_wal directory? It's hard to know.

"git blame" can be helpful here, if you trace back to when the comment
was written and then try to find the associated mailing-list discussion.
(That leap can be difficult for commits pre-dating our current
convention of including links in the commit message, but it's usually
not *that* hard to locate contemporaneous discussion.)

            regards, tom lane

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

11 April 2022, 16:52:57

On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
>> <horikyota.ntt@gmail.com> wrote:
>>> If this diagnosis is correct, the comment is proved to be paranoid.
> 
>> It's sometimes difficult to understand what problems really old code
>> comments are worrying about. For example, could they have been
>> worrying about bugs in the code? Could they have been worrying about
>> manual interference with the pg_wal directory? It's hard to know.
> 
> "git blame" can be helpful here, if you trace back to when the comment
> was written and then try to find the associated mailing-list discussion.
> (That leap can be difficult for commits pre-dating our current
> convention of including links in the commit message, but it's usually
> not *that* hard to locate contemporaneous discussion.)

I traced this back a while ago.  I believe the link() was first added in
November 2000 as part of f0e37a8.  This even predates WAL recycling, which
was added in July 2001 as part of 7d4d5c0.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Kyotaro Horiguchi

Date:

12 April 2022, 06:46:31

At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi
> >> <horikyota.ntt@gmail.com> wrote:
> >>> If this diagnosis is correct, the comment is proved to be paranoid.
> > 
> >> It's sometimes difficult to understand what problems really old code
> >> comments are worrying about. For example, could they have been
> >> worrying about bugs in the code? Could they have been worrying about
> >> manual interference with the pg_wal directory? It's hard to know.
> > 
> > "git blame" can be helpful here, if you trace back to when the comment
> > was written and then try to find the associated mailing-list discussion.
> > (That leap can be difficult for commits pre-dating our current
> > convention of including links in the commit message, but it's usually
> > not *that* hard to locate contemporaneous discussion.)
> 
> I traced this back a while ago.  I believe the link() was first added in
> November 2000 as part of f0e37a8.  This even predates WAL recycling, which
> was added in July 2001 as part of 7d4d5c0.

f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
somwhere out of the ML.. This patch changed XLogFileInit to
supportusing existent files so that XLogWrite can use the new segment
provided by checkpoint and still allow XLogWrite to create a new
segment by itself.

Just before the commit, calls to XLogFileInit were protected (or
serialized) by logwr_lck.  At the commit calls to the same function
were still serialized by ControlFileLockId.

I *guess* that Vadim faced/noticed a race condition when he added
checkpoint. Thus introduced the link+remove protocol but finally it
became useless by moving the call to XLogFileInit within
ControlFileLockId section.  But, of course, all of story is mere a
guess.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

12 April 2022, 16:27:42

On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
>> I traced this back a while ago.  I believe the link() was first added in
>> November 2000 as part of f0e37a8.  This even predates WAL recycling, which
>> was added in July 2001 as part of 7d4d5c0.
> 
> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
> somwhere out of the ML.. This patch changed XLogFileInit to
> supportusing existent files so that XLogWrite can use the new segment
> provided by checkpoint and still allow XLogWrite to create a new
> segment by itself.

Yeah, I've been unable to find any discussion besides a brief reference to
adding checkpointing [0].

[0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

18 April 2022, 07:48:35

On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
> On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I think there might be another problem.  The man page for rename() seems to
>> indicate that overwriting an existing file also introduces a window where
>> the old and new path are hard links to the same file.  This isn't a problem
>> for the WAL files because we should never be overwriting an existing one,
>> but I wonder if it's a problem for other code paths.  My guess is that many
>> code paths that overwrite an existing file are first writing changes to a
>> temporary file before atomically replacing the original.  Those paths are
>> likely okay, too, as you can usually just discard any existing temporary
>> files.
>
> I wonder if this is really true. I thought rename() was supposed to be atomic.

Not always.  For example, some old versions of MacOS have a non-atomic
implementation of rename(), like prairiedog with 10.4.  Even 10.5 does
not handle atomicity as far as I call.  In short, it looks like a bad
idea to me to rely on this idea at all.  Some FSes have their own way
of handling things, as well, but I am not much into this world.

Saying that, it would be nice to see durable_rename_excl() gone as it
has created quite a bit of pain for us in the past years.
--
Michael

Attachment

signature.asc

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

18 April 2022, 18:23:36

On Mon, Apr 18, 2022 at 04:48:35PM +0900, Michael Paquier wrote:
> Saying that, it would be nice to see durable_rename_excl() gone as it
> has created quite a bit of pain for us in the past years.

Yeah, I think this is the right thing to do.  Patch upthread [0].

For back-branches, I suspect we'll want to remove all uses of
durable_rename_excl() but leave the function around for any extensions that
are using it.  Of course, we'd also need a big comment imploring folks not
to add any more callers.  Another option would be to change the behavior of
durable_rename_excl() to something that we think is safer (e.g., stat then
rename), but that might just introduce a different set of problems.

[0] https://postgr.es/m/20220408194345.GA1541826%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Tom Lane

Date:

18 April 2022, 19:07:15

Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
>> I wonder if this is really true. I thought rename() was supposed to be atomic.

> Not always.  For example, some old versions of MacOS have a non-atomic
> implementation of rename(), like prairiedog with 10.4.  Even 10.5 does
> not handle atomicity as far as I call.

I think that's not talking about the same thing.  POSIX requires rename(2)
to replace an existing target link atomically:

    If the link named by the new argument exists, it shall be removed and
    old renamed to new. In this case, a link named new shall remain
    visible to other threads throughout the renaming operation and refer
    either to the file referred to by new or old before the operation
    began.

(It's that requirement that ancient macOS fails to meet.)

However, I do not see any text that addresses the question of whether
the old link disappears atomically with the appearance of the new link,
and it seems like that'd be pretty impractical to ensure in cases like
moving a link from one directory to another.  (What would it even mean
to say that, considering that a thread can't read the two directories
at the same instant?)  From a crash-safety standpoint, it'd surely be
better to make the new link before removing the old, so I imagine
that's what most file systems do.

            regards, tom lane

Re: avoid multiple hard links to same WAL file after a crash

From

Greg Stark

Date:

18 April 2022, 20:53:53

The readdir interface allows processes to be in the middle of reading
a directory and unless a kernel was happy to either materialize the
entire directory list when the readdir starts, or lock the entire
directory against modification for the entire time the a process has a
readdir fd open it's always going to be possible for the a process to
have previously read the old directory entry and later see the new
directory entry. Kernels don't do any MVCC or cmin type of games so
they're not going to be able to prevent it.

What's worse of course is that it may only happen in very large
directories. Most directories fit on a single block and readdir may
buffer up all the entries a block at a time for efficiency. So it may
only be visible on very large directories that span multiple blocks.

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

26 April 2022, 20:09:35

Here is an attempt at creating something that can be back-patched.  0001
simply replaces calls to durable_rename_excl() with durable_rename() and is
intended to be back-patched.  0002 removes the definition of
durable_rename_excl() and is _not_ intended for back-patching.  I imagine
0002 will need to be held back for v16devel.

I think back-patching 0001 will encounter a couple of small obstacles.  For
example, the call in basic_archive won't exist on most of the
back-branches, and durable_rename_excl() was named durable_link_or_rename()
before v13.  I don't mind producing a patch for each back-branch if needed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

27 April 2022, 07:09:20

On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote:
> Here is an attempt at creating something that can be back-patched.  0001
> simply replaces calls to durable_rename_excl() with durable_rename() and is
> intended to be back-patched.  0002 removes the definition of
> durable_rename_excl() and is _not_ intended for back-patching.  I imagine
> 0002 will need to be held back for v16devel.

I would not mind applying 0002 on HEAD now to avoid more uses of this
API, and I can get behind 0001 after thinking more about it.

> I think back-patching 0001 will encounter a couple of small obstacles.  For
> example, the call in basic_archive won't exist on most of the
> back-branches, and durable_rename_excl() was named durable_link_or_rename()
> before v13.  I don't mind producing a patch for each back-branch if needed.

I am not sure that have any need to backpatch this change based on the
unlikeliness of the problem, TBH.  One thing that is itching me a bit,
like Robert upthread, is that we don't check anymore that the newfile
does not exist in the code paths because we never expect one.  It is
possible to use stat() for that.  But access() within a simple
assertion would be simpler?  Say something like:
Assert(access(path, F_OK) != 0 && errno == ENOENT);

The case for basic_archive is limited as the comment of the patch
states, but that would be helpful for the two calls in timeline.c and
the one in xlog.c in the long-term.  And this has no need to be part
of fd.c, this can be added before the durable_rename() calls.  What do
you think?
--
Michael

Attachment

signature.asc

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

27 April 2022, 18:42:04

On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote:
> I am not sure that have any need to backpatch this change based on the
> unlikeliness of the problem, TBH.  One thing that is itching me a bit,
> like Robert upthread, is that we don't check anymore that the newfile
> does not exist in the code paths because we never expect one.  It is
> possible to use stat() for that.  But access() within a simple
> assertion would be simpler?  Say something like:
> Assert(access(path, F_OK) != 0 && errno == ENOENT);
> 
> The case for basic_archive is limited as the comment of the patch
> states, but that would be helpful for the two calls in timeline.c and
> the one in xlog.c in the long-term.  And this has no need to be part
> of fd.c, this can be added before the durable_rename() calls.  What do
> you think?

Here is a new patch set with these assertions added.  I think at least the
xlog.c change ought to be back-patched.  The problem may be unlikely, but
AFAICT the possible consequences include WAL corruption.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

28 April 2022, 05:56:01

On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote:
> Here is a new patch set with these assertions added. I think at least the
> xlog.c change ought to be back-patched. The problem may be unlikely, but
> AFAICT the possible consequences include WAL corruption.

Okay, so I have applied this stuff this morning to see what the
buildfarm had to say, and we have finished with a set of failures in
various buildfarm members:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-28%2002%3A13%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-04-28%2002%3A14%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-04-28%2002%3A59%3A26

All of them did not like the part where we assume that a TLI history
file written by a WAL receiver should not exist beforehand, but as
025_stuck_on_old_timeline.pl is showing, a standby may attempt to
retrieve a TLI history file after getting it from the archives.

I was analyzing the whole thing, and it looks like a race condition.
Per the the buildfarm logs, we have less than 5ms between the moment
the startup process retrieves the history file of TLI 2 from the
archives and the moment the WAL receiver decides to check if this TLI
file exists. If it does not exist, it would then retrieve it from the
primary via streaming. So I guess that the sequence of events is
that:
- In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the
existence of the history file for TLI 2, does not find it.
- The startup process retrieves the file from the archives.
- The WAL receiver goes through the internal loop of
WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the
primary's stream.

Switching from durable_rename_excl() to durable_rename() would mean
that we'd overwrite the TLI file received from the primary stream over
what's been retrieved from the archives. That does not strike me as
an issue in itself and that should be safe, so the comment is
misleading, and we can live without the assertion in
writeTimeLineHistoryFile() called by the WAL receiver. Now, I think
that we'd better keep some belts in writeTimeLineHistory() called by
the startup process at the end-of-recovery as I should never ever have
a TLI file generated when selecting a new timeline. Perhaps this
should be a elog(ERROR) at least, with a check on the file existence
before calling durable_rename()?

Anyway, my time is constrained next week due to the upcoming Japanese
Golden Week and the buildfarm has to be stable, so I have reverted the
change for now.
--
Michael

Attachment

signature.asc

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

01 May 2022, 13:08:53

On Tue, Apr 12, 2022 at 09:27:42AM -0700, Nathan Bossart wrote:
> On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote:
>> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
>>> I traced this back a while ago.  I believe the link() was first added in
>>> November 2000 as part of f0e37a8.  This even predates WAL recycling, which
>>> was added in July 2001 as part of 7d4d5c0.
>>
>> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from
>> somwhere out of the ML.. This patch changed XLogFileInit to
>> supportusing existent files so that XLogWrite can use the new segment
>> provided by checkpoint and still allow XLogWrite to create a new
>> segment by itself.

Yes, I think that you are right here.  I also suspect that the
checkpoint command was facing a concurrency issue while working on
the feature and that Vadim saw that this part of the implementation
would be safer in the long run if we use link() followed by unlink().

> Yeah, I've been unable to find any discussion besides a brief reference to
> adding checkpointing [0].
>
> [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com

While looking at the history of this area, I have also noticed this
argument, telling also that this is a safety measure if this code were
to run in parallel, but that's without counting on the control file
lock hold while doing this operation anyway:
https://www.postgresql.org/message-id/24974.982597735@sss.pgh.pa.us

As mentioned already upthread, f0e37a8 is the origin of the
link()/unlink() business in the WAL segment initialization logic, and
also note 1f159e5 that has added a rename() as extra code path for
systems where link() was not working.

At the end, switching directly from durable_rename_excl() to
durable_rename() should be fine for the WAL segment initialization,
but we could do things a bit more carefully by adding a check on the
file existence before calling durable_rename() and issue a elog(LOG)
if a file is found, giving a mean for the WAL recycling to give up
peacefully as it does now.  Per my analysis, the TLI history file
created at the end of recovery ought to issue an elog(ERROR).

Now, I am surprised by the third code path of durable_rename_excl(),
as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
any issues, as link() should exit with EEXIST when the startup process
grabs the same history file concurrently.  It seems to me that in this
last case using durable_rename() could be an improvement and prevent
extra WAL receiver restarts as a TLI history fetched from the primary
via streaming or from some archives should be the same, but we could
be more careful, like the WAL init logic, by skipping the
durable_rename() and issuing an elog(LOG).  That would not be perfect,
still a bit better than the current state of HEAD.

As we are getting closer to the beta release, it looks safer to let
this change aside a bit longer and wait for v16 to be opened for
business on HEAD.
--
Michael

Attachment

signature.asc

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

02 May 2022, 10:48:18

On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote:
> Now, I am surprised by the third code path of durable_rename_excl(),
> as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause
> any issues, as link() should exit with EEXIST when the startup process
> grabs the same history file concurrently.  It seems to me that in this
> last case using durable_rename() could be an improvement and prevent
> extra WAL receiver restarts as a TLI history fetched from the primary
> via streaming or from some archives should be the same, but we could
> be more careful, like the WAL init logic, by skipping the
> durable_rename() and issuing an elog(LOG).  That would not be perfect,
> still a bit better than the current state of HEAD.

Skimming through at the buildfarm logs, it happens that the tests are
able to see this race from time to time.  Here is one such example on
rorqual:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check

And here are the relevant logs:
2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG:  restored log
file "00000002.history" from archive
2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG:  fetching
timeline history file for timeline 2 from primary server
2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL:  could
not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history":
File exists
[...]
2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG:  started
streaming WAL from primary at 0/3000000 on timeline 2

The WAL receiver upgrades the ERROR to a FATAL, and restarts
streaming shortly after.  Using durable_rename() would not be an issue
here.
--
Michael

Attachment

signature.asc

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

02 May 2022, 17:36:57

On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote:
> At the end, switching directly from durable_rename_excl() to
> durable_rename() should be fine for the WAL segment initialization,
> but we could do things a bit more carefully by adding a check on the
> file existence before calling durable_rename() and issue a elog(LOG)
> if a file is found, giving a mean for the WAL recycling to give up
> peacefully as it does now.  Per my analysis, the TLI history file
> created at the end of recovery ought to issue an elog(ERROR).

My only concern with this approach is that it inevitably introduces a race
condition.  In most cases, the file existence check will prevent
overwrites, but it might not always.  Furthermore, we believe that such
overwrites either 1) should not happen (e.g., WAL recycling) or 2) won't
cause problems if they happen (e.g., when the WAL receiver writes the TLI
history file).  Also, these races will be difficult to test, so we won't
know what breaks when they occur.

My instinct is to just let the overwrites happen.  That way, we are more
likely to catch breakage in tests, and we'll have one less race condition
to worry about.  I don't mind asserting that the file doesn't exist when we
don't expect it to, as that might help catch potential problems in
development without affecting behavior in production.  If we do want to
add file existence checks, I think we'd better add a comment about the
potential for race conditions.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

02 May 2022, 17:39:07

On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote:
> Skimming through at the buildfarm logs, it happens that the tests are
> able to see this race from time to time.  Here is one such example on
> rorqual:
>
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check
> 
> And here are the relevant logs:
> 2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG:  restored log
> file "00000002.history" from archive
> 2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG:  fetching
> timeline history file for timeline 2 from primary server
> 2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL:  could
> not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history":
> File exists
> [...]
> 2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG:  started
> streaming WAL from primary at 0/3000000 on timeline 2
> 
> The WAL receiver upgrades the ERROR to a FATAL, and restarts
> streaming shortly after.  Using durable_rename() would not be an issue
> here.

Thanks for investigating this one.  I think I agree that we should simply
switch to durable_rename() (without a file existence check beforehand).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

02 May 2022, 23:06:13

On Mon, May 02, 2022 at 10:39:07AM -0700, Nathan Bossart wrote:
> On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote:
>> The WAL receiver upgrades the ERROR to a FATAL, and restarts
>> streaming shortly after.  Using durable_rename() would not be an issue
>> here.
> 
> Thanks for investigating this one.  I think I agree that we should simply
> switch to durable_rename() (without a file existence check beforehand).

Here is a new patch set.  For now, I've only removed the file existence
check in writeTimeLineHistoryFile().  I don't know if I'm totally convinced
that there isn't a problem here (e.g., due to concurrent .ready file
creation), but since some platforms have been using rename() for some time,
I don't know how worried we should be.  I thought about adding some kind of
locking between the WAL receiver and startup processes, but that seems
excessive.  Alternatively, we could just fix xlog.c as proposed earlier
[0].  AFAICT that is the only caller that can experience problems due to
the multiple-hard-link issue.  All other callers are simply renaming a
temporary file into place, and the temporary file can be discarded if left
behind after a crash.

[0] https://postgr.es/m/20220407182954.GA1231544%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

05 May 2022, 11:10:02

On Mon, May 02, 2022 at 04:06:13PM -0700, Nathan Bossart wrote:
> Here is a new patch set.  For now, I've only removed the file existence
> check in writeTimeLineHistoryFile().  I don't know if I'm totally convinced
> that there isn't a problem here (e.g., due to concurrent .ready file
> creation), but since some platforms have been using rename() for some time,
> I don't know how worried we should be.

That's only about Windows these days, meaning that there is much less
coverage in this code path.

> I thought about adding some kind of
> locking between the WAL receiver and startup processes, but that seems
> excessive.

Agreed.

> Alternatively, we could just fix xlog.c as proposed earlier
> [0].  AFAICT that is the only caller that can experience problems due to
> the multiple-hard-link issue.  All other callers are simply renaming a
> temporary file into place, and the temporary file can be discarded if left
> behind after a crash.

I'd agree with removing all the callers at the end.  pgrename() is
quite robust on Windows, but I'd keep the two checks in
writeTimeLineHistory(), as the logic around findNewestTimeLine() would
consider a past TLI history file as in-use even if we have a crash
just after the file got created in the same path by the same standby,
and the WAL segment init part.  Your patch does that.
--
Michael

Attachment

signature.asc

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

05 July 2022, 01:19:49

On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
> I'd agree with removing all the callers at the end.  pgrename() is
> quite robust on Windows, but I'd keep the two checks in
> writeTimeLineHistory(), as the logic around findNewestTimeLine() would
> consider a past TLI history file as in-use even if we have a crash
> just after the file got created in the same path by the same standby,
> and the WAL segment init part.  Your patch does that.

As v16 is now open for business, I have revisited this change and
applied 0001 to change all the callers (aka removal of the assertion
for the WAL receiver when it overwrites a TLI history file).  The
commit log includes details about the reasoning of all the areas
changed, for clarity, as of the WAL recycling part, the TLI history
file part and basic_archive.
--
Michael

Attachment

signature.asc

Re: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

05 July 2022, 16:58:38

On Tue, Jul 05, 2022 at 10:19:49AM +0900, Michael Paquier wrote:
> On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
>> I'd agree with removing all the callers at the end.  pgrename() is
>> quite robust on Windows, but I'd keep the two checks in
>> writeTimeLineHistory(), as the logic around findNewestTimeLine() would
>> consider a past TLI history file as in-use even if we have a crash
>> just after the file got created in the same path by the same standby,
>> and the WAL segment init part.  Your patch does that.
> 
> As v16 is now open for business, I have revisited this change and
> applied 0001 to change all the callers (aka removal of the assertion
> for the WAL receiver when it overwrites a TLI history file).  The
> commit log includes details about the reasoning of all the areas
> changed, for clarity, as of the WAL recycling part, the TLI history
> file part and basic_archive. 

Thanks!  I wonder if we should add a comment in writeTimeLineHistoryFile()
about possible concurrent use by a WAL receiver and the startup process and
why that is okay.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

05 July 2022, 23:57:24

On Tue, Jul 05, 2022 at 09:58:38AM -0700, Nathan Bossart wrote:
> Thanks!  I wonder if we should add a comment in writeTimeLineHistoryFile()
> about possible concurrent use by a WAL receiver and the startup process and
> why that is okay.

Agreed.  Adding an extra note at the top of the routine would help in
the future.
--
Michael

Attachment

signature.asc

Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Robert Pang

Date:

18 December 2024, 03:50:16

Dear team,

We recently observed a few cases where Postgres running on Linux
encountered an issue with WAL segment files. Specifically, two WAL
segments were linked to the same physical file after Postgres ran out
of memory and the OOM killer terminated one of its processes. This
resulted in the WAL segments overwriting each other and Postgres
failing a later recovery.

We found this fix [1] that has been applied to Postgres 16, but the
cases we observed were running Postgres 15. Given that older major
versions will be supported for a good number of years, and the
potential for irrecoverability exists (even if rare), we would like to
discuss the possibility of back-patching this fix.

Are there any technical reasons not to back-patch this fix to older
major versions?

Thank you for your consideration.

Sincerely,
Robert Pang

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3

On Sat, May 7, 2022 at 1:19 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
> > I'd agree with removing all the callers at the end.  pgrename() is
> > quite robust on Windows, but I'd keep the two checks in
> > writeTimeLineHistory(), as the logic around findNewestTimeLine() would
> > consider a past TLI history file as in-use even if we have a crash
> > just after the file got created in the same path by the same standby,
> > and the WAL segment init part.  Your patch does that.
>
> As v16 is now open for business, I have revisited this change and
> applied 0001 to change all the callers (aka removal of the assertion
> for the WAL receiver when it overwrites a TLI history file).  The
> commit log includes details about the reasoning of all the areas
> changed, for clarity, as of the WAL recycling part, the TLI history
> file part and basic_archive.
> --
> Michael

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

18 December 2024, 19:38:19

On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:
> We recently observed a few cases where Postgres running on Linux
> encountered an issue with WAL segment files. Specifically, two WAL
> segments were linked to the same physical file after Postgres ran out
> of memory and the OOM killer terminated one of its processes. This
> resulted in the WAL segments overwriting each other and Postgres
> failing a later recovery.

Yikes!

> We found this fix [1] that has been applied to Postgres 16, but the
> cases we observed were running Postgres 15. Given that older major
> versions will be supported for a good number of years, and the
> potential for irrecoverability exists (even if rare), we would like to
> discuss the possibility of back-patching this fix.

IMHO this is a good time to reevaluate.  It looks like we originally didn't
back-patch out of an abundance of caution, but now that this one has had
time to bake, I think it's worth seriously considering, especially now that
we have a report from the field.

-- 
nathan

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Andres Freund

Date:

19 December 2024, 04:51:20

Hi,

On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote:
> On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:
> > We recently observed a few cases where Postgres running on Linux
> > encountered an issue with WAL segment files. Specifically, two WAL
> > segments were linked to the same physical file after Postgres ran out
> > of memory and the OOM killer terminated one of its processes. This
> > resulted in the WAL segments overwriting each other and Postgres
> > failing a later recovery.
>
> Yikes!

Indeed.  As chance would have it, I was asked for input on a corrupted server
*today*. Eventually we found that recovery stopped early, after encountering a
segment with a *newer* pageaddr than we expected. Which made me think of this
issue, and indeed, the file recovery stopped at had two links.  Before that
the server had been crashing on a regular basis for unrelated reasons, which
presumably increased the chances sufficiently to eventually hit this problem.

It's a normal thing to discover the end of the WAL by finding a segment that
has an older pageaddr than its name suggests. But in this case we saw a newer
page address.  I wonder if we should treat that differently...

> > We found this fix [1] that has been applied to Postgres 16, but the
> > cases we observed were running Postgres 15. Given that older major
> > versions will be supported for a good number of years, and the
> > potential for irrecoverability exists (even if rare), we would like to
> > discuss the possibility of back-patching this fix.
>
> IMHO this is a good time to reevaluate.  It looks like we originally didn't
> back-patch out of an abundance of caution, but now that this one has had
> time to bake, I think it's worth seriously considering, especially now that
> we have a report from the field.

Strongly agreed.

I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message.  The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server.  And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.

Greetings,

Andres Freund

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

19 December 2024, 18:31:14

On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> I've been double-checking the code to refresh myself with the problem,
> and I don't see a reason to not apply something like the attached set
> down to v13 for all these remaining branches (minus an edit of the
> commit message).

LGTM

-- 
nathan

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Andres Freund

Date:

19 December 2024, 19:07:25

On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
> On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> > I've been double-checking the code to refresh myself with the problem,
> > and I don't see a reason to not apply something like the attached set
> > down to v13 for all these remaining branches (minus an edit of the
> > commit message).
> 
> LGTM

Dito.

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Robert Pang

Date:

20 December 2024, 04:45:12

Dear Michael,

Thank you for applying this back-patch. I also appreciate everyone's input on this issue.

Sincerely,

Robert Pang

On Thu, Dec 19, 2024 at 4:13 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:
> On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
>> LGTM
>
> Dito.

Thanks for double-checking. Done this one.
--
Michael

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

06 March, 22:30:13

On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> On Wed, Dec 18, 2024 at 08:51:20PM -0500, Andres Freund wrote:
> > I don't think the issue is actually quite as unlikely to be hit as reasoned in
> > the commit message.  The crash has indeed to happen between the link() and
> > unlink() - but at the end of a checkpoint we do that operations hundreds of
> > times in a row on a busy server.  And that's just after potentially doing lots
> > of write IO during a checkpoint, filling up drive write caches / eating up
> > IOPS/bandwidth disk quots.
> 
> Looks so, yep.  Your timing and the report's timing are interesting.
> 
> I've been double-checking the code to refresh myself with the problem,
> and I don't see a reason to not apply something like the attached set
> down to v13 for all these remaining branches (minus an edit of the
> commit message).

This became v14 commit 1f95181b44c843729caaa688f74babe9403b5850.  I think it
is causing timing-dependent failures in archive recovery, at restartpoints.
v14 and earlier were relying on the "excl" part of durable_rename_excl() to
keep WAL preallocation and recycling from stomping on archive-restored WAL
files.  If that's right, we need to either back-patch v15's suppression of
those features during archive recovery (commit cc2c7d6 and five others), or we
need to revert and fix the multiple hard links differently.

v15 commit cc2c7d6 "Skip WAL recycling and preallocation during archive
recovery" wrote that it fixed "a spurious durable_rename_excl() LOG message".
Replacing durable_rename_excl() with durable_rename() increased consequences
beyond spurious messages.  I regret that I didn't make that connection during
post-commit review of commit 1f95181b44c843729caaa688f74babe9403b5850.
Trouble emerges during archive recovery, not during streaming or crash
recovery.  The symptom is "invalid magic number 0000 in log segment X, offset
0" when PreallocXlogFiles() stomps.  The symptom is "unexpected pageaddr X in
log segment Y, offset 0" [X < Y] when RemoveOldXlogFiles() recycling stomps.
wal_recycle=off probably avoids the latter but not the former.

Options I see:

1. Make v14 and v13 skip WAL recycling and preallocation during archive
   recovery, like newer branches do.  I think that means back-patching the six
   commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.

2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
   Avoid multiple hard links by making durable_rename_excl() first rename
   oldfile to a temporary name.  (I haven't thought this through in detail.
   It may not suffice.)

I'm leaning toward (1) at least enough to see how messy the back-patch would
be, since I don't like risks of designing old-branch-specific solutions when
v15/v16/v17 have a proven solution.  What else should we be thinking about
before deciding?

Arun Thirupathi collected the symptoms, traced them to commit
1f95181b44c843729caaa688f74babe9403b5850, and reported the diagnosis to me.
These symptoms have not emerged in v15+.

Thanks,
nm

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Nathan Bossart

Date:

06 March, 22:44:30

On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote:
> Options I see:
> 
> 1. Make v14 and v13 skip WAL recycling and preallocation during archive
>    recovery, like newer branches do.  I think that means back-patching the six
>    commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.
> 
> 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
>    Avoid multiple hard links by making durable_rename_excl() first rename
>    oldfile to a temporary name.  (I haven't thought this through in detail.
>    It may not suffice.)
> 
> I'm leaning toward (1) at least enough to see how messy the back-patch would
> be, since I don't like risks of designing old-branch-specific solutions when
> v15/v16/v17 have a proven solution.  What else should we be thinking about
> before deciding?

I agree with first attempting a back-patch.  All of this stuff is from
2021-2022, so it's had time to bake and is IMHO lower risk.

-- 
nathan

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

10 March, 08:25:28

On Thu, Mar 06, 2025 at 01:44:30PM -0600, Nathan Bossart wrote:
> On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote:
>> 1. Make v14 and v13 skip WAL recycling and preallocation during archive
>>    recovery, like newer branches do.  I think that means back-patching the six
>>    commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.
>>
>> 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
>>    Avoid multiple hard links by making durable_rename_excl() first rename
>>    oldfile to a temporary name.  (I haven't thought this through in detail.
>>    It may not suffice.)
>>
>> I'm leaning toward (1) at least enough to see how messy the back-patch would
>> be, since I don't like risks of designing old-branch-specific solutions when
>> v15/v16/v17 have a proven solution.  What else should we be thinking about
>> before deciding?
>
> I agree with first attempting a back-patch.  All of this stuff is from
> 2021-2022, so it's had time to bake and is IMHO lower risk.

Sorry for the late reply here (life things and the timing of Noah's
report).

Discarding the option where we would design a new solution specific to
v14 and v13, I don't really see a different third option here.

Choice 2 means that we'd just leave v13 and v14 exposed to
corruptions, and that the message to our user base is that they should
upgrade to v15 to get that fixed, even if we still have roughly two
more 2 years of official community support, as they would still be
exposed to the risk of having multiple links pointing to a single WAL
segment, creating corruption.  As reported upthread, that can be
reached.

Choice 1 would be the logical path, even if it may not be the
practical one in terms of backpatch risks vs benefits.  It would be
much better than recommending people that they have to upgrade to 15
at least to make sure that their version of Postgres is safer to use.

Saying that, I've looked at the whole for a good part of the day.  In
15~, one thing that may make a backpatch messier is related to
recovery and the fact that Robert has improved/refactored the area
quite a bit, tweaking the code so as we rely on slightly different
assumptions in 13/14 compared to 15~.  There is a piece around
ThisTimeLineID, for example, which is something that
InstallXLogFileSegment() uses and also something that's touched by
1f95181b44c8.  The impact of these pieces of refactorings specific to
v15 is surprisingly lower than I thought it would.

Some notes about the six commits you are mentioning, looking at them
one by one on the 13 and 14 stable branches.

First, for v14:
cc2c7d6~4: some bumps with the removal of use_lock with one flag.
Looks OK otherwise.
cc2c7d6~3: Looks OK.  Based on what this addresses, this is low-risk.
cc2c7d6~2: XLogWalRcvWrite() has a conflict as an effect of
XLogWalRcvWrite() that can be given a "tli".  It still seems correct
to me to use "recvFileTLI = ThisTimeLineID".
cc2c7d6~1: 8ad6c5dbbe5a needs to be reverted first.  Commit
8ad6c5dbbe5a has been introduced as a workaround for what we would
backpatch here.  One conflict in walreceiver.c.
cc2c7d6: Does not conflict, which was a bit surprising, TBH.
043_no_contrecord_switch.pl gets unhappy here, on an assertion failure
in WaitForWALToBecomeAvailable() about InstallXLogFileSegmentActive.
e36cbef: Previous failure fixed by this one, with the same symptoms
as reported in the original thread that led to this commit.

Then, for v13:
cc2c7d6~4: One conflict with InstallXLogFileSegment() in xlog.c,
nothing huge.
cc2c7d6~3: Looks OK.
cc2c7d6~2: Looks OK.
cc2c7d6~1: With 8ad6c5dbbe5a reverted, same as the v14 part.
cc2c7d6: Does not conflict.  043_no_contrecord_switch.pl gets unhappy
here, same as above.
e36cbef: Same as above.

Noah, all these improvements are something you have directly worked
on in 15~.  I'm hoping not to have missed something, even if I've
looked at the paths manipulating the WAL segments changed here on v13
and v14.  Could it be possible for you to double-check and also run
more tests if your enviroments help?  Perhaps this needs specific
prerequisites for recovery?  We need more confidence in the solution
here, even if it's proving to work for v15, we may still have gaps
specific to v14 and/or v13.

I am attaching a full set of patches for v14 and v13 that can be used
for these tests.  WDYT?
--
Michael

Attachment

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

10 March, 09:21:04

On Mon, Mar 10, 2025 at 02:25:28PM +0900, Michael Paquier wrote:
> I am attaching a full set of patches for v14 and v13 that can be used
> for these tests.  WDYT?

And I forgot to mention that I've checked both v14 and v13, where I
have noticed the two patterns "invalid magic number 0000 in log
segment X, offset 0" and "unexpected pageaddr X in log segment Y" as
LOG and DEBUG entries.

These are gone once the patches are applied on both branches.  Just
set max_wal_size and min_wal_size (128MB here) low enough with a
standby doing archive recovery, and bulk-load WAL on the primary to
accelerate the reproduction of the problem.  It takes a couple of
minutes to have these show up with the unpatched branches.
--
Michael

Attachment

signature.asc

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

11 March, 23:57:49

On Mon, Mar 10, 2025 at 02:25:28PM +0900, Michael Paquier wrote:
> On Thu, Mar 06, 2025 at 01:44:30PM -0600, Nathan Bossart wrote:
> > On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote:
> >> 1. Make v14 and v13 skip WAL recycling and preallocation during archive
> >>    recovery, like newer branches do.  I think that means back-patching the six
> >>    commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.
> >> 
> >> 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
> >>    Avoid multiple hard links by making durable_rename_excl() first rename
> >>    oldfile to a temporary name.  (I haven't thought this through in detail.
> >>    It may not suffice.)
> >> 
> >> I'm leaning toward (1) at least enough to see how messy the back-patch would
> >> be, since I don't like risks of designing old-branch-specific solutions when
> >> v15/v16/v17 have a proven solution.  What else should we be thinking about
> >> before deciding?
> > 
> > I agree with first attempting a back-patch.  All of this stuff is from
> > 2021-2022, so it's had time to bake and is IMHO lower risk.

> Could it be possible for you to double-check and also run
> more tests if your enviroments help?

Thanks for crafting back-branch versions.  I've queued a task to confirm I get
the same result.  There's a test case I'll polish, too.

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

12 March, 03:46:27

On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote:
> Thanks for crafting back-branch versions.  I've queued a task to confirm I get
> the same result.

Thanks for that.  That helps a lot.

> There's a test case I'll polish, too.

Are you considering the addition of a TAP test in 17~ based on a wait
injection point in the checkpointer coupled with a check of the server
logs to see if we see the error patterns you've spotted?
--
Michael

Attachment

signature.asc

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

12 March, 04:23:15

On Wed, Mar 12, 2025 at 09:46:27AM +0900, Michael Paquier wrote:
> On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote:
> > Thanks for crafting back-branch versions.  I've queued a task to confirm I get
> > the same result.
> 
> Thanks for that.  That helps a lot.

I'll let you know when I get there.

> > There's a test case I'll polish, too.
> 
> Are you considering the addition of a TAP test in 17~ based on a wait
> injection point in the checkpointer coupled with a check of the server
> logs to see if we see the error patterns you've spotted?

No, nothing involving injection points or otherwise version-specific.

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

03 April, 03:29:00

On Tue, Mar 11, 2025 at 06:23:15PM -0700, Noah Misch wrote:
> On Wed, Mar 12, 2025 at 09:46:27AM +0900, Michael Paquier wrote:
> > On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote:
> > > Thanks for crafting back-branch versions.  I've queued a task to confirm I get
> > > the same result.
> > 
> > Thanks for that.  That helps a lot.
> 
> I'll let you know when I get there.

Your back-patches are correct.  Thanks.

> > > There's a test case I'll polish, too.
> > 
> > Are you considering the addition of a TAP test in 17~ based on a wait
> > injection point in the checkpointer coupled with a check of the server
> > logs to see if we see the error patterns you've spotted?
> 
> No, nothing involving injection points or otherwise version-specific.

Here it is.  Making it fail three times took looping 1383s, 5841s, and 2594s.
Hence, it couldn't be expected to catch the regression before commit, but it
would have made sufficient buildfarm and CI noise in the day after commit.

Attachment

test-archive-restartpoint-v1.patch

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

05 April, 13:14:27

On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:
> Your back-patches are correct.  Thanks.

Thanks for double-checking.  I'll move on with what I have after a
second look as it's been a few weeks since I've looked at all these
conflicts.  I am also planning to add a few notes in the commits to
mention this thread.

> Here it is.  Making it fail three times took looping 1383s, 5841s, and 2594s.
> Hence, it couldn't be expected to catch the regression before commit, but it
> would have made sufficient buildfarm and CI noise in the day after commit.

Hmm.  Not much of a fan of the addition of a test that has less than
1% of reproducibility for the problem, even if it's good to see that
this can be made portable to run down to v13.
--
Michael

Attachment

signature.asc

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Tom Lane

Date:

05 April, 18:07:13

Michael Paquier <michael@paquier.xyz> writes:
> On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:
>> Here it is.  Making it fail three times took looping 1383s, 5841s, and 2594s.
>> Hence, it couldn't be expected to catch the regression before commit, but it
>> would have made sufficient buildfarm and CI noise in the day after commit.

> Hmm.  Not much of a fan of the addition of a test that has less than
> 1% of reproducibility for the problem, even if it's good to see that
> this can be made portable to run down to v13.

Yeah, it's good to have a test but I doubt we should commit it.
Too many buildfarm cycles will be expended for too little result.

            regards, tom lane

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

05 April, 22:13:39

On Sat, Apr 05, 2025 at 11:07:13AM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:
> >> Here it is.  Making it fail three times took looping 1383s, 5841s, and 2594s.
> >> Hence, it couldn't be expected to catch the regression before commit, but it
> >> would have made sufficient buildfarm and CI noise in the day after commit.
> 
> > Hmm.  Not much of a fan of the addition of a test that has less than
> > 1% of reproducibility for the problem, even if it's good to see that
> > this can be made portable to run down to v13.
> 
> Yeah, it's good to have a test but I doubt we should commit it.
> Too many buildfarm cycles will be expended for too little result.

Current extent of our archive recovery restartpoint test coverage:

$ grep -c 'restartpoint starting' $(grep -rl 'restored log file' **/log) | grep -v :0
src/bin/pg_combinebackup/tmp_check/log/002_compare_backups_pitr1.log:1
src/test/recovery/tmp_check/log/020_archive_status_standby2.log:1
src/test/recovery/tmp_check/log/002_archiving_standby.log:1
src/test/recovery/tmp_check/log/020_archive_status_standby.log:1
src/test/recovery/tmp_check/log/035_standby_logical_decoding_standby.log:2

Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
that's not enough.  If the proposed 3-second test is the wrong thing, what
instead?

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

06 April, 01:42:02

On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote:
> Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
> that's not enough.  If the proposed 3-second test is the wrong thing, what
> instead?

I don't have a good idea about that in ~16, TBH, but I am sure to not
be a fan of the low reproducibility rate of this test as proposed.
It's not perfect, but as the design to fix the original race condition
has been introduced in v15, why not begin with a test in 17~ using
some injection points?  This should be good enough while having a good
reproduction rate as the order of the actions in the restart points
would be controlled.
--
Michael

Attachment

signature.asc

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

06 April, 05:09:58

On Sun, Apr 06, 2025 at 07:42:02AM +0900, Michael Paquier wrote:
> On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote:
> > Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
> > that's not enough.  If the proposed 3-second test is the wrong thing, what
> > instead?
> 
> I don't have a good idea about that in ~16, TBH, but I am sure to not
> be a fan of the low reproducibility rate of this test as proposed.
> It's not perfect, but as the design to fix the original race condition
> has been introduced in v15, why not begin with a test in 17~ using
> some injection points?

Two reasons:

a) The fix ended calls to the whole range of relevant code.  Hence, the
   injection point placement that would have been relevant before the fix
   isn't reached.  In other words, there's no right place for the injection
   point.  (The place for the injection point would be in durable_rename(), in
   the checkpointer.  After the fix, the checkpointer just doesn't call
   durable_rename().)

b) Stochastic tests catch defects beyond the specific one the test author
   targeted.  An injection point test is less likely to do that.  (That said,
   with reason (a), there's no known injection point test design to compete
   with the stochastic design.)

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

06 April, 09:53:14

On Sat, Apr 05, 2025 at 07:14:27PM +0900, Michael Paquier wrote:
> On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote:
>> Your back-patches are correct.  Thanks.
>
> Thanks for double-checking.  I'll move on with what I have after a
> second look as it's been a few weeks since I've looked at all these
> conflicts.  I am also planning to add a few notes in the commits to
> mention this thread.

And applies this series on REL_14_STABLE and REL_13_STABLE, editing
all the commits to mention what they were linked to previously.
--
Michael

Attachment

signature.asc

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

13 April, 18:33:12

On Sat, Apr 05, 2025 at 07:09:58PM -0700, Noah Misch wrote:
> On Sun, Apr 06, 2025 at 07:42:02AM +0900, Michael Paquier wrote:
> > On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote:
> > > Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
> > > that's not enough.  If the proposed 3-second test is the wrong thing, what
> > > instead?
> > 
> > I don't have a good idea about that in ~16, TBH, but I am sure to not
> > be a fan of the low reproducibility rate of this test as proposed.
> > It's not perfect, but as the design to fix the original race condition
> > has been introduced in v15, why not begin with a test in 17~ using
> > some injection points?
> 
> Two reasons:
> 
> a) The fix ended calls to the whole range of relevant code.  Hence, the
>    injection point placement that would have been relevant before the fix
>    isn't reached.  In other words, there's no right place for the injection
>    point.  (The place for the injection point would be in durable_rename(), in
>    the checkpointer.  After the fix, the checkpointer just doesn't call
>    durable_rename().)
> 
> b) Stochastic tests catch defects beyond the specific one the test author
>    targeted.  An injection point test is less likely to do that.  (That said,
>    with reason (a), there's no known injection point test design to compete
>    with the stochastic design.)

Tom and Michael, do you still object to the test addition, or not?  If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

As another data point, raising the runtime from 3s to 17s makes it reproduce
the problem 25% of the time.  You can imagine a plot with axes of runtime and
percent detection.  One can pick any point on that plot's curve.  Given how
little wall time it takes for the buildfarm and CI to reach a few hundred
runs, I like the trade-off of 3s runtime and 1% detection.  In particular, I
like it better than 17s runtime for 25% detection.  How do you see it?

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Tom Lane

Date:

13 April, 18:51:57

Noah Misch <noah@leadboat.com> writes:
> Tom and Michael, do you still object to the test addition, or not?  If there
> are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

While I still don't love it, I don't have a better proposal.

            regards, tom lane

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Michael Paquier

Date:

14 April, 03:19:35

On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > Tom and Michael, do you still object to the test addition, or not?  If there
> > are no new or renewed objections by 2025-04-20, I'll proceed to add the test.
>
> While I still don't love it, I don't have a better proposal.

No objections from here to use your proposal for the time being on all
the branches.  I am planning to spend some cycles thinking about a
better alternative, but I doubt that this will be portable down to
v13.
--
Michael

Attachment

signature.asc

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

21 April, 00:53:39

On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:
> On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > Tom and Michael, do you still object to the test addition, or not?  If there
> > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

Pushed as commit 714bd9e.  The failure so far is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35
with these highlights:

pg_ctl: server does not shut down

2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG:  received immediate shutdown request
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL:  archive command was terminated by signal 3: Quit
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL:  The failed archive command was: cp
"pg_wal/00000001000000000000006D"
"/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"

The checkpoints and WAL creation took 30s, but archiving was only 20% done
(based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.  I can
reproduce this if I test with valgrind --trace-children=yes.  With my normal
valgrind settings, the whole test file takes only 18s.  I recommend one of
these changes to skink:

- Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
  "sh" and "cp" commands.
- Remove --trace-children=yes

Andres, what do you think about making one of those skink configuration
changes?  Alternatively, I could make the test poll until archiving catches
up.  However, that would take skink about 30min, and I expect little value
from 30min of valgrind instrumenting the "cp" command.

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

21 April, 01:15:59

On Sun, Apr 20, 2025 at 02:53:39PM -0700, Noah Misch wrote:
> On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:
> > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:
> > > Noah Misch <noah@leadboat.com> writes:
> > > > Tom and Michael, do you still object to the test addition, or not?  If there
> > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test.
> 
> Pushed as commit 714bd9e.  The failure so far is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35
> with these highlights:
> 
> pg_ctl: server does not shut down
> 
> 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG:  received immediate shutdown request
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL:  archive command was terminated by signal 3: Quit
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL:  The failed archive command was: cp
"pg_wal/00000001000000000000006D"
"/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"
> 
> The checkpoints and WAL creation took 30s, but archiving was only 20% done
> (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.  I can
> reproduce this if I test with valgrind --trace-children=yes.  With my normal
> valgrind settings, the whole test file takes only 18s.  I recommend one of
> these changes to skink:
> 
> - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
>   "sh" and "cp" commands.
> - Remove --trace-children=yes

I gave that more thought.  One can be more surgical than that, via
--trace-children-skip-by-arg='*cp "*' or similar.  My previous message's two
options stop valgrind instrumentation at boundaries like pg_dumpall calling
system(pg_dump ...), since that execs /bin/sh to run pg_dump.  If we wanted to
make it even more explicit and surgical, skink could use
--trace-children-skip-by-arg='*valgrind-ignore-child*' combined with:

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1404 +1404 @@ sub enable_restoring
-      : qq{cp "$path/%f" "%p"};
+      : qq{cp "$path/%f" "%p"  # valgrind-ignore-child};
@@ -1474 +1474 @@ sub enable_archiving
-      : qq{cp "%p" "$path/%f"};
+      : qq{cp "%p" "$path/%f"  # valgrind-ignore-child};

What's your preference?

> Andres, what do you think about making one of those skink configuration
> changes?  Alternatively, I could make the test poll until archiving catches
> up.  However, that would take skink about 30min, and I expect little value
> from 30min of valgrind instrumenting the "cp" command.

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Andres Freund

Date:

25 April, 22:35:06

Hi,

On 2025-04-20 14:53:39 -0700, Noah Misch wrote:
> On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:
> > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:
> > > Noah Misch <noah@leadboat.com> writes:
> > > > Tom and Michael, do you still object to the test addition, or not?  If there
> > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test.
> 
> Pushed as commit 714bd9e.  The failure so far is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35
> with these highlights:
> 
> pg_ctl: server does not shut down
> 
> 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG:  received immediate shutdown request
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL:  archive command was terminated by signal 3: Quit
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL:  The failed archive command was: cp
"pg_wal/00000001000000000000006D"
"/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"
> 
> The checkpoints and WAL creation took 30s, but archiving was only 20% done
> (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.

Huh.  That seems surprisingly slow, even for valgrind.  I guess it's one more
example for why the single-threaded archiving approach sucks so badly :)


> I can reproduce this if I test with valgrind --trace-children=yes.  With my
> normal valgrind settings, the whole test file takes only 18s.  I recommend
> one of these changes to skink:
> 
> - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
>   "sh" and "cp" commands.
> - Remove --trace-children=yes

Hm. I think I used --trace-children=yes because I was thinking it was required
to track forks. But a newer version of valgrind's man page has an important
clarification:
       --trace-children=<yes|no> [default: no]
           When enabled, Valgrind will trace into sub-processes initiated via the exec system call. This is necessary
formulti-process programs.
 
 
           Note that Valgrind does trace into the child of a fork (it would be difficult not to, since fork makes an
identicalcopy of a process), so this
 
           option is arguably badly named. However, most children of fork calls immediately call exec anyway.

So there doesn't seem to be much point in using --trace-children=yes.


> Andres, what do you think about making one of those skink configuration
> changes?  Alternatively, I could make the test poll until archiving catches
> up.  However, that would take skink about 30min, and I expect little value
> from 30min of valgrind instrumenting the "cp" command.

I just changed the config to --trace-children=no. There already is a valgrind
run in progress, so it won't be in effect for the next run.

Greetings,

Andres Freund

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From

Noah Misch

Date:

27 April, 02:03:15

On Fri, Apr 25, 2025 at 03:35:06PM -0400, Andres Freund wrote:
> On 2025-04-20 14:53:39 -0700, Noah Misch wrote:
> > The checkpoints and WAL creation took 30s, but archiving was only 20% done
> > (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.
> 
> Huh.  That seems surprisingly slow, even for valgrind.  I guess it's one more
> example for why the single-threaded archiving approach sucks so badly :)

Yes!  I also didn't expect that v14/v13 would run much faster.

> I just changed the config to --trace-children=no. There already is a valgrind
> run in progress, so it won't be in effect for the next run.

Works for me.  I see that resolved things.