On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote:
> I suspect it might be easier to reproduce the issue with smaller WAL
> segments, a short checkpoint_timeout, and multiple jobs generating WAL
> and then sleeping for random amounts of time. Not sure if that's the
> sole ingredient, but consider what happens there's processes that
> XLogWrite()s some WAL and then sleeps. Typically such a process'
> openLogFile will still point to the WAL segment. And they may still do
> that when the next checkpoint finishes and we recycle the WAL file.
Yep. That's basically the kind of scenarios I have been testing to
stress the recycling/removing, with pgbench putting some load into the
server. This has worked for me. Once. But I have little idea why it
gets easier to reproduce in the environments of others, so there may
be an OS-version dependency in the equation here.
> I wonder if we actually fail to unlink() the file in
> durable_link_or_rename(), and then end up recycling the same old file
> into multiple "future" positions in the WAL stream.
You actually mean durable_rename_excl() as of 13~, right? Yeah, this
matches my impression that it is a two-step failure:
- Failure in one of the steps of durable_rename_excl().
- Fallback to segment removal, where we get the complain about
renaming.
> 1) and 2) seems problematic for restore_command use. I wonder if there's
> a chance that some of the reports ended up hitting 3), and that windows
> doesn't handle that well.
Yeap. I was thinking about 3) being the actual problem while going
through those docs two days ago.
> If you manage to reproduce, could you check what the link count of the
> all the segments is? Apparently sysinternal's findlinks can do that.
>
> Or perhaps even better, add an error check that the number of links of
> WAL segments is 1 in a bunch of places (recycling, opening them, closing
> them, maybe?).
>
> Plus error reporting for unlink failures, of course.
Yep, that's actually something I wrote for my own setups, with
log_checkpoints enabled to catch all concurrent checkpoint activity
and some LOGs. Still no luck unfortunately :(
--
Michael