Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) - Mailing list pgsql-bugs

From Alexander Lakhin
Subject Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Date
Msg-id c4c0e925-7b00-a555-9c56-80573af5ff4a@gmail.com
Whole thread Raw
In response to Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
List pgsql-bugs
Hello Tom,
16.12.2019 0:26, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>>> The regression tests on Windows sometimes fail with 'Permission denied'
>>> errors. For example:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2019-12-11%2007%3A45%3A33
>> I see two ways to fix the issue:
>> 1) Unlink postmaster.pid using rename operation (adopt the solution from
>> https://stackoverflow.com/questions/3764072/).
>> 2) Ignore such 'Permission denied' error and just try to open the file
>> once again (attached fix_open_for_unlink.patch implements this).
>> I'm inclined to the second approach as pgwin32_open() already handles
>> two transient states (Windows-only), and it could be useful not only for
>> postmaster.pid, but for some other files.
> Agreed that (2) seems like the way to go.  However, I'm not too pleased
> with the patch as given, because it is gratuitously different in almost
> every possible way from the adjacent code that's doing just about the same
> thing for those other transient failures.  Why is the timeout duration
> different?  Why is the looping logic not identical?  Why did you make a
> different decision about whether logging might be a good idea?  Actually,
> why didn't you just extend the existing if-block to also cover this case?
> Maybe there's good reasons to be different, but you didn't explain them.
Thanks for your questions. I'll try to add more details.
I didn't want to use the existing code for three reasons.
First, there are too many differences in the parameters: other error
code, other error reason, no need for logging.
Second, I wanted to decrease a delay, as whole unlink operation
(comprising of three kernel calls: CreateFile,
SetDispositionInformationFile, CloseFile) performed for me in 0.0002
seconds (as ProcessMonitor showed for several runs). Assuming even
hundredfold duration seems sufficient. On the other side,
ERROR_ACCESS_DENIED can be caused by a real access restriction and
waiting for 30 seconds before erroring out is too long.
Third, the logic of the existing loop confused me — I haven't seen a
reason to sleep for a last time and then return an error without
rechecking a condition.
And I think that the logging is irrelevant as we want to handle here
just the DELETE_PENDING state. If the "access denied" error state is
disappeared within a second, then we can log only "the file probably was
in a deletion state and now it's not", and if this state is persisting,
we can log "access is really denied", but I believe that the existing
calling code does the same anyway.
> I'm also not that excited about memorializing a stackoverflow discussion
> as the reason to do something.  Can we point to something in Microsoft's
> official docs?
I'm not excited about this too, but I couldn't find something more
appropriate.
For example. there is a similar question on their forums:

https://social.technet.microsoft.com/Forums/office/en-US/58e5c670-f024-44ff-9919-36c44ab11d9c/file-delete-pending-problem?forum=w7itprogeneral
but it left without a reference to their docs.
WebArchive contains their knowledge base article where
STATUS_DELETE_PENDING is mapped to ERROR_ACCESS_DENIED, but today this
information cannot be found on microsoft.com:
https://web.archive.org/web/20150317121919/http://support.microsoft.com/en-us/kb/113996
Maybe they consider this as too low-level implementation details...
I think, they have some internal documentation where this is described
in details, but we can't reference it anyway.
On the other hand, finding a couple of references to stackoverflow in
src/ made me feel that it is not widely accepted, but still accepted (as
an exception).

Best regards.
Alexander



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Next
From: avinash varma
Date:
Subject: Planning time is high in Postgres 11.5 Compared with Postgres 10.11