Hi,
On 2021-03-18 09:55:46 +0900, Michael Paquier wrote:
> Let's see how it goes from this point, but, FWIW, I have not been able
> to reproduce again my similar problem with the archive command :/ --
I suspect it might be easier to reproduce the issue with smaller WAL
segments, a short checkpoint_timeout, and multiple jobs generating WAL
and then sleeping for random amounts of time. Not sure if that's the
sole ingredient, but consider what happens there's processes that
XLogWrite()s some WAL and then sleeps. Typically such a process'
openLogFile will still point to the WAL segment. And they may still do
that when the next checkpoint finishes and we recycle the WAL file.
I wonder if we actually fail to unlink() the file in
durable_link_or_rename(), and then end up recycling the same old file
into multiple "future" positions in the WAL stream.
There's also these interesting notes at
https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createhardlinka
1)
> The security descriptor belongs to the file to which a hard link
> points. The link itself is only a directory entry, and does not have a
> security descriptor. Therefore, when you change the security
> descriptor of a hard link, you a change the security descriptor of the
> underlying file, and all hard links that point to the file allow the
> newly specified access. You cannot give a file different security
> descriptors on a per-hard-link basis.
2)
> Flags, attributes, access, and sharing that are specified in
> CreateFile operate on a per-file basis. That is, if you open a file
> that does not allow sharing, another application cannot share the file
> by creating a new hard link to the file.
3)
> The maximum number of hard links that can be created with this
> function is 1023 per file. If more than 1023 links are created for a
> file, an error results.
1) and 2) seems problematic for restore_command use. I wonder if there's
a chance that some of the reports ended up hitting 3), and that windows
doesn't handle that well.
If you manage to reproduce, could you check what the link count of the
all the segments is? Apparently sysinternal's findlinks can do that.
Or perhaps even better, add an error check that the number of links of
WAL segments is 1 in a bunch of places (recycling, opening them, closing
them, maybe?).
Plus error reporting for unlink failures, of course.
Greetings,
Andres Freund