Thread: Re: Permission failures with WAL files in 13~ on Windows

Re: Permission failures with WAL files in 13~ on Windows

From
Michael Paquier
Date:
On Tue, Mar 16, 2021 at 11:40:12AM +0100, Magnus Hagander wrote:
> If we can provide a new .EXE built with exactly the same flags as the
> EDB downloads that they can just drop into a directory, I think it's a
> lot easier to get that done.

Yeah, multiple people have been complaining about that bug, so I have
just produced two builds that people with those sensitive environments
can use, and sent some private links to get the builds.  Let's see how
it goes from this point, but, FWIW, I have not been able to reproduce
again my similar problem with the archive command :/
--
Michael

Attachment

Re: Permission failures with WAL files in 13~ on Windows

From
Andres Freund
Date:
Hi,

On 2021-03-18 09:55:46 +0900, Michael Paquier wrote:
> Let's see how it goes from this point, but, FWIW, I have not been able
> to reproduce again my similar problem with the archive command :/ --

I suspect it might be easier to reproduce the issue with smaller WAL
segments, a short checkpoint_timeout, and multiple jobs generating WAL
and then sleeping for random amounts of time. Not sure if that's the
sole ingredient, but consider what happens there's processes that
XLogWrite()s some WAL and then sleeps. Typically such a process'
openLogFile will still point to the WAL segment. And they may still do
that when the next checkpoint finishes and we recycle the WAL file.

I wonder if we actually fail to unlink() the file in
durable_link_or_rename(), and then end up recycling the same old file
into multiple "future" positions in the WAL stream.

There's also these interesting notes at
https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createhardlinka

1)
> The security descriptor belongs to the file to which a hard link
> points. The link itself is only a directory entry, and does not have a
> security descriptor. Therefore, when you change the security
> descriptor of a hard link, you a change the security descriptor of the
> underlying file, and all hard links that point to the file allow the
> newly specified access. You cannot give a file different security
> descriptors on a per-hard-link basis.

2)
> Flags, attributes, access, and sharing that are specified in
> CreateFile operate on a per-file basis. That is, if you open a file
> that does not allow sharing, another application cannot share the file
> by creating a new hard link to the file.

3)
> The maximum number of hard links that can be created with this
> function is 1023 per file. If more than 1023 links are created for a
> file, an error results.


1) and 2) seems problematic for restore_command use. I wonder if there's
a chance that some of the reports ended up hitting 3), and that windows
doesn't handle that well.


If you manage to reproduce, could you check what the link count of the
all the segments is? Apparently sysinternal's findlinks can do that.

Or perhaps even better, add an error check that the number of links of
WAL segments is 1 in a bunch of places (recycling, opening them, closing
them, maybe?).

Plus error reporting for unlink failures, of course.

Greetings,

Andres Freund



Re: Permission failures with WAL files in 13~ on Windows

From
Michael Paquier
Date:
On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote:
> I suspect it might be easier to reproduce the issue with smaller WAL
> segments, a short checkpoint_timeout, and multiple jobs generating WAL
> and then sleeping for random amounts of time. Not sure if that's the
> sole ingredient, but consider what happens there's processes that
> XLogWrite()s some WAL and then sleeps. Typically such a process'
> openLogFile will still point to the WAL segment. And they may still do
> that when the next checkpoint finishes and we recycle the WAL file.

Yep.  That's basically the kind of scenarios I have been testing to
stress the recycling/removing, with pgbench putting some load into the
server.  This has worked for me.  Once.  But I have little idea why it
gets easier to reproduce in the environments of others, so there may
be an OS-version dependency in the equation here.

> I wonder if we actually fail to unlink() the file in
> durable_link_or_rename(), and then end up recycling the same old file
> into multiple "future" positions in the WAL stream.

You actually mean durable_rename_excl() as of 13~, right?  Yeah, this
matches my impression that it is a two-step failure:
- Failure in one of the steps of durable_rename_excl().
- Fallback to segment removal, where we get the complain about
renaming.

> 1) and 2) seems problematic for restore_command use. I wonder if there's
> a chance that some of the reports ended up hitting 3), and that windows
> doesn't handle that well.

Yeap.  I was thinking about 3) being the actual problem while going
through those docs two days ago.

> If you manage to reproduce, could you check what the link count of the
> all the segments is? Apparently sysinternal's findlinks can do that.
>
> Or perhaps even better, add an error check that the number of links of
> WAL segments is 1 in a bunch of places (recycling, opening them, closing
> them, maybe?).
>
> Plus error reporting for unlink failures, of course.

Yep, that's actually something I wrote for my own setups, with
log_checkpoints enabled to catch all concurrent checkpoint activity
and some LOGs.  Still no luck unfortunately :(
--
Michael

Attachment

Re: Permission failures with WAL files in 13~ on Windows

From
Michael Paquier
Date:
On Thu, Mar 18, 2021 at 12:01:40PM +0900, Michael Paquier wrote:
> Yep, that's actually something I wrote for my own setups, with
> log_checkpoints enabled to catch all concurrent checkpoint activity
> and some LOGs.  Still no luck unfortunately :(

The various reporters had more luck than myself in reproducing the
issue, so I have applied 909b449e to address the issue.  I am pretty
sure that we should review more this business in the future, but I'd
rather not touch the stable branches.
--
Michael

Attachment