Re: avoid multiple hard links to same WAL file after a crash - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: avoid multiple hard links to same WAL file after a crash
Date
Msg-id 20220408165312.GB1409034@nathanxps13
Whole thread Raw
In response to Re: avoid multiple hard links to same WAL file after a crash  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: avoid multiple hard links to same WAL file after a crash  (Nathan Bossart <nathandbossart@gmail.com>)
Re: avoid multiple hard links to same WAL file after a crash  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I see that durable_rename_excl() has the following comment: "Similar
> to durable_rename(), except that this routine tries (but does not
> guarantee) not to overwrite the target file." If those are the desired
> semantics, we could achieve them more simply and more safely by just
> trying to stat() the target file and then, if it's not found, call
> durable_rename(). I think that would be a heck of a lot safer than
> what this function is doing right now.

IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
all.  Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().

> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I think there might be another problem.  The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file.  This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths.  My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original.  Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

> Your proposed fix is OK if we don't want to do any of that stuff, but
> personally I'm much more inclined to blame durable_rename_excl() for
> being horrible than I am to blame the calling code for using it
> improvidently.

I do agree that it's worth examining this stuff a bit closer.  I've
frequently found myself trying to reason about all the different states
that callers of these functions can produce, so any changes that help
simplify matters are a win in my book.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: SATYANARAYANA NARLAPURAM
Date:
Subject: Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
Next
From: Matthias van de Meent
Date:
Subject: Improving btree performance through specializing by key shape, take 2