Re: Permission failures with WAL files in 13~ on Windows - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Permission failures with WAL files in 13~ on Windows
Date
Msg-id YFLClG7KfETQ+xFG@paquier.xyz
Whole thread Raw
In response to Re: Permission failures with WAL files in 13~ on Windows  (Andres Freund <andres@anarazel.de>)
Responses Re: Permission failures with WAL files in 13~ on Windows  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote:
> I suspect it might be easier to reproduce the issue with smaller WAL
> segments, a short checkpoint_timeout, and multiple jobs generating WAL
> and then sleeping for random amounts of time. Not sure if that's the
> sole ingredient, but consider what happens there's processes that
> XLogWrite()s some WAL and then sleeps. Typically such a process'
> openLogFile will still point to the WAL segment. And they may still do
> that when the next checkpoint finishes and we recycle the WAL file.

Yep.  That's basically the kind of scenarios I have been testing to
stress the recycling/removing, with pgbench putting some load into the
server.  This has worked for me.  Once.  But I have little idea why it
gets easier to reproduce in the environments of others, so there may
be an OS-version dependency in the equation here.

> I wonder if we actually fail to unlink() the file in
> durable_link_or_rename(), and then end up recycling the same old file
> into multiple "future" positions in the WAL stream.

You actually mean durable_rename_excl() as of 13~, right?  Yeah, this
matches my impression that it is a two-step failure:
- Failure in one of the steps of durable_rename_excl().
- Fallback to segment removal, where we get the complain about
renaming.

> 1) and 2) seems problematic for restore_command use. I wonder if there's
> a chance that some of the reports ended up hitting 3), and that windows
> doesn't handle that well.

Yeap.  I was thinking about 3) being the actual problem while going
through those docs two days ago.

> If you manage to reproduce, could you check what the link count of the
> all the segments is? Apparently sysinternal's findlinks can do that.
>
> Or perhaps even better, add an error check that the number of links of
> WAL segments is 1 in a bunch of places (recycling, opening them, closing
> them, maybe?).
>
> Plus error reporting for unlink failures, of course.

Yep, that's actually something I wrote for my own setups, with
log_checkpoints enabled to catch all concurrent checkpoint activity
and some LOGs.  Still no luck unfortunately :(
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
Next
From: Andres Freund
Date:
Subject: Re: Getting better results from valgrind leak tracking