Thread: Intermittent pg_ctl failures on Windows

Intermittent pg_ctl failures on Windows

From

Tom Lane

Date:

11 March 2018, 04:48:28

The buildfarm's Windows members occasionally show weird pg_ctl failures,
for instance this recent case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2018-03-10%2020%3A30%3A20

### Restarting node "master"
# Running: pg_ctl -D
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata-l
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.logrestart
waiting for server to shut down.... done
server stopped
waiting for server to start....The process cannot access the file because it is being used by another process.
stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out! system pg_ctl failed

or this one:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2017-12-29%2023%3A30%3A24

### Stopping node "subscriber" using mode fast
# Running: pg_ctl -D
c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata-m fast stop
waiting for server to shut down....pg_ctl: could not open PID file
"c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
Permissiondenied
Bail out! system pg_ctl failed

I'd been writing these off as Microsoft randomness and/or antivirus
interference, but it suddenly occurred to me that there might be a
consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
for server start/stop, it is trying to read postmaster.pid more-or-less
concurrently with the postmaster writing to that file. On Unix that's not
much of a problem, but I believe that on Windows you have to specifically
open the file with sharing enabled, or you get error messages like these.
The postmaster should be enabling sharing, because port.h redirects
open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing flags.
But it only does that #ifndef FRONTEND. So pg_ctl is just using naked
open(), which could explain these failures.

If this theory is accurate, it should be pretty easy to replicate the
problem if you modify the postmaster to hold postmaster.pid open longer
when rewriting it, e.g. stick fractional-second sleeps into CreateLockFile
and AddToDataDirLockFile.

I'm not in a position to investigate this in detail nor test a fix,
but I think somebody should.

regards, tom lane

RE: Intermittent pg_ctl failures on Windows

From

Badrul Chowdhury

Date:

12 March 2018, 17:55:45

Hi Tom,

This is a great catch. I am looking into it: I will start by reproducing the error as you suggested.

Thanks,
Badrul

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, March 10, 2018 2:48 PM
To: pgsql-hackers@lists.postgresql.org
Subject: Intermittent pg_ctl failures on Windows

The buildfarm's Windows members occasionally show weird pg_ctl failures, for instance this recent case:


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0

### Restarting node "master"
# Running: pg_ctl -D
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata-l
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.logrestart waiting for
serverto shut down.... done server stopped waiting for server to start....The process cannot access the file because it
isbeing used by another process. 
 stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out!  system pg_ctl failed

or this one:


https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0

### Stopping node "subscriber" using mode fast # Running: pg_ctl -D
c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata-m fast stop
waitingfor server to shut down....pg_ctl: could not open PID file
"c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
Permissiondenied Bail out!  system pg_ctl failed 

I'd been writing these off as Microsoft randomness and/or antivirus interference, but it suddenly occurred to me that
theremight be a consistent explanation: since commit f13ea95f9, when pg_ctl is waiting for server start/stop, it is
tryingto read postmaster.pid more-or-less concurrently with the postmaster writing to that file.  On Unix that's not
muchof a problem, but I believe that on Windows you have to specifically open the file with sharing enabled, or you get
errormessages like these. 
The postmaster should be enabling sharing, because port.h redirects open/fopen to pgwin32_open/pgwin32_fopen which
enablethe sharing flags. 
But it only does that #ifndef FRONTEND.  So pg_ctl is just using naked open(), which could explain these failures.

If this theory is accurate, it should be pretty easy to replicate the problem if you modify the postmaster to hold
postmaster.pidopen longer when rewriting it, e.g. stick fractional-second sleeps into CreateLockFile and
AddToDataDirLockFile.

I'm not in a position to investigate this in detail nor test a fix, but I think somebody should.

            regards, tom lane

Re: Intermittent pg_ctl failures on Windows

From

r.zharkov@postgrespro.ru

Date:

17 July 2019, 13:40:36

Hello!

We reproduced these errors on out buildfarm and my windows workstation.
We used small TAP test that restarts PostgresNode in loop. Additionally,
constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000 
times.

There are two different problems with pg_ctl:
1 - share access to postmaster.pid;

2 - share access to logfile. Postmaster runs in the shell (CMD.EXE).
When the server stops cmd.exe may blocks logfile from been opened by new 
cmd.exe.
> waiting for server to shut down.... done
> server stopped
> waiting for server to start....The process cannot access the file 
> because it is being used by another process.
>  stopped waiting
> pg_ctl: could not start server
> Examine the log output.

To avoid the first error we made two changes:
pg_ctl now opens the postmaster.pid file using pgwin32_open() function 
to correctly handle share locks.
On Windows systems we cannot handle ERROR_DELETE_PENDING because 
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

To avoid the second error we added the wait for cmd.exe finish in the 
do_stop() and do_restart() functions.

Git patch is in the attachment.

On 2018-03-12 18:55, Badrul Chowdhury wrote:
> Hi Tom,
> 
> This is a great catch. I am looking into it: I will start by
> reproducing the error as you suggested.
> 
> Thanks,
> Badrul
> 
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Saturday, March 10, 2018 2:48 PM
> To: pgsql-hackers@lists.postgresql.org
> Subject: Intermittent pg_ctl failures on Windows
> 
> The buildfarm's Windows members occasionally show weird pg_ctl
> failures, for instance this recent case:
> 
>
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0
> 
> ### Restarting node "master"
> # Running: pg_ctl -D
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
> -l
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log
> restart waiting for server to shut down.... done server stopped
> waiting for server to start....The process cannot access the file
> because it is being used by another process.
>  stopped waiting
> pg_ctl: could not start server
> Examine the log output.
> Bail out!  system pg_ctl failed
> 
> or this one:
> 
>
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0
> 
> ### Stopping node "subscriber" using mode fast # Running: pg_ctl -D
> c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata
> -m fast stop waiting for server to shut down....pg_ctl: could not open
> PID file
>
"c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
> Permission denied Bail out!  system pg_ctl failed
> 
> I'd been writing these off as Microsoft randomness and/or antivirus
> interference, but it suddenly occurred to me that there might be a
> consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
> for server start/stop, it is trying to read postmaster.pid
> more-or-less concurrently with the postmaster writing to that file.
> On Unix that's not much of a problem, but I believe that on Windows
> you have to specifically open the file with sharing enabled, or you
> get error messages like these.
> The postmaster should be enabling sharing, because port.h redirects
> open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing
> flags.
> But it only does that #ifndef FRONTEND.  So pg_ctl is just using naked
> open(), which could explain these failures.
> 
> If this theory is accurate, it should be pretty easy to replicate the
> problem if you modify the postmaster to hold postmaster.pid open
> longer when rewriting it, e.g. stick fractional-second sleeps into
> CreateLockFile and AddToDataDirLockFile.
> 
> I'm not in a position to investigate this in detail nor test a fix,
> but I think somebody should.
> 
>             regards, tom lane

-- 

regars, Roman Zharkov

Attachment

pg_ctl.patch

Re: Intermittent pg_ctl failures on Windows

From

Tom Lane

Date:

17 July 2019, 16:51:48

r.zharkov@postgrespro.ru writes:
> pg_ctl now opens the postmaster.pid file using pgwin32_open() function 
> to correctly handle share locks.

HEAD already does that, no?  See f02259fe9.

> On Windows systems we cannot handle ERROR_DELETE_PENDING because 
> GetLastError() returns ERROR_ACCESS_DENIED instead.
> So we rename the lock files before delete them.

This seems improbably broken/stupid.  Also, I've not seen any buildfarm
failures that would match this; it's always pg_ctl complaining not the
postmaster.

> To avoid the second error we added the wait for cmd.exe finish in the 
> do_stop() and do_restart() functions.

Hmmm ... there seems the germ of a good idea here, but can't we do
it with less (and less ugly) code?

Alternatively, perhaps we could fix things so that the parent cmd.exe
shell isn't involved in logfile access?  That'd require teaching the
postmaster to open/redirect its stdout/stderr, which is kind of
annoying, but it might beat hacking things as you have done here.

            regards, tom lane

Re: Intermittent pg_ctl failures on Windows

From

r.zharkov@postgrespro.ru

Date:

17 July 2019, 17:54:16

On 2019-07-17 20:51, Tom Lane wrote:
> r.zharkov@postgrespro.ru writes:
>> pg_ctl now opens the postmaster.pid file using pgwin32_open() function
>> to correctly handle share locks.
> 
> HEAD already does that, no?  See f02259fe9.
> 

You are right. I tested branch REL_11_STABLE and it is my mistake.

>> On Windows systems we cannot handle ERROR_DELETE_PENDING because
>> GetLastError() returns ERROR_ACCESS_DENIED instead.
>> So we rename the lock files before delete them.
> 
> This seems improbably broken/stupid.  Also, I've not seen any buildfarm
> failures that would match this; it's always pg_ctl complaining not the
> postmaster.
> 

The probability is very small. In one of our tests pg_ctl fails with 
that error sometime.
In a test with multiple frequent restarts the probability is 5-6%. On 
other machines probability differs.
To find out the real error code we used the Process Monitor utility.

>> To avoid the second error we added the wait for cmd.exe finish in the
>> do_stop() and do_restart() functions.
> 
> Hmmm ... there seems the germ of a good idea here, but can't we do
> it with less (and less ugly) code?
> 
> Alternatively, perhaps we could fix things so that the parent cmd.exe
> shell isn't involved in logfile access?  That'd require teaching the
> postmaster to open/redirect its stdout/stderr, which is kind of
> annoying, but it might beat hacking things as you have done here.
> 

If we need parent cmd.exe only for log writing, can we start the 
postmaster without it?
Is cmd.exe necessary?


-- 
regards, Roman Zharkov

Re: Intermittent pg_ctl failures on Windows

From

Michael Paquier

Date:

18 July 2019, 08:38:34

On Wed, Jul 17, 2019 at 09:51:48AM -0400, Tom Lane wrote:
> r.zharkov@postgrespro.ru writes:
>> On Windows systems we cannot handle ERROR_DELETE_PENDING because
>> GetLastError() returns ERROR_ACCESS_DENIED instead.
>> So we rename the lock files before delete them.
>
> This seems improbably broken/stupid.  Also, I've not seen any buildfarm
> failures that would match this; it's always pg_ctl complaining not the
> postmaster.

Oh, it is.  A lot.  And this has been discussed a couple of times
across multiple threads on -bugs and/or -hackers.  I think that
Windows just lacks a way to detect reliably if a file is pending for
deletion or not, and there is no way that we are going to enforce
blindly a ERROR_DELETE_PENDING when we see EACCES.  One code path
which is impacted by that is our wrapper for stat() for the win32
port...  For now I'd rather believe that what the OS tells us is true,
in which case here it is plain wrong, but well.
--
Michael

Attachment

signature.asc

Re: Intermittent pg_ctl failures on Windows

From

Жарков Роман

Date:

18 July 2019, 10:04:34

Hello,

Therefore, we suggest replace the deletion of a lock file by renaming it.
unlink in windows is not an atomic operation.
When we try to open the file between
SetDispositionInformationFile and CloseFile we get ERROR_DELETE_PENDING ( see screenshot )


regards, Roman

> 18 июля 2019 г., в 12:38, Michael Paquier <michael@paquier.xyz>
>
> Oh, it is.  A lot.  And this has been discussed a couple of times
> across multiple threads on -bugs and/or -hackers.  I think that
> Windows just lacks a way to detect reliably if a file is pending for
> deletion or not, and there is no way that we are going to enforce
> blindly a ERROR_DELETE_PENDING when we see EACCES.  One code path
> which is impacted by that is our wrapper for stat() for the win32
> port...  For now I'd rather believe that what the OS tells us is true,
> in which case here it is plain wrong, but well.
> --
> Michael

Attachment

image1.png

Re: Intermittent pg_ctl failures on Windows

From

Michael Paquier

Date:

18 July 2019, 11:51:48

On Wed, Jul 17, 2019 at 09:54:16PM +0700, r.zharkov@postgrespro.ru wrote:
> You are right. I tested branch REL_11_STABLE and it is my mistake.
>
> [...]
>
> The probability is very small. In one of our tests pg_ctl fails with that
> error sometime.
> In a test with multiple frequent restarts the probability is 5-6%. On other
> machines probability differs.
> To find out the real error code we used the Process Monitor utility.

I would like to be sure of something here.  Are those failures based
on what you have tested on REL_11_STABLE or do you still see pg_ctl
complain about such things on HEAD (post f02259fe as mentioned by Tom
upthread)?  I would like to think that you mean the former and the
latter, but this thread mentions only the former.
--
Michael

Attachment

signature.asc

Re: Intermittent pg_ctl failures on Windows

From

Kyotaro Horiguchi

Date:

18 July 2019, 12:00:48

At Thu, 18 Jul 2019 14:04:34 +0700, Жарков Роман <r.zharkov@postgrespro.ru> wrote in
<ECD07611-BBA8-4E9B-975C-50E59F7154DA@postgrespro.ru>
> Hello,
>
> Therefore, we suggest replace the deletion of a lock file by renaming it.
> unlink in windows is not an atomic operation.
> When we try to open the file between
> SetDispositionInformationFile and CloseFile we get ERROR_DELETE_PENDING ( see screenshot )

I found a mail in the arvhive.

https://www.postgresql.org/message-id/CABUevExwz9ncAk90Hb3Sb4MHeaOEyS59eTb_AKW9qFEjAidg5Q%40mail.gmail.com

> > > > One approach would be to rename the file into something indicating it's
> > > > being deleted, before actually deleting it.
> > > >
> > > >
> > > That would be an option (IIRC if you open with FILE_SHARE_DELETE, it can
> > > also be renamed). But if we can track the delete when we try to open it
> > and
> > > just treat it as "file does not exist", that seems cleaner.
> >
> > I'm not sure what exactly you're suggesting. But isn't there the issue
> > that such an approach will not interoperate with external tools?
> >
>
>
>
> I'll have to admit I wasn't thinking of external tools. The external tools
> would have to learn about pending-delete files themselves. So yeah, if we
> care about those then renaming them to something predictable and then
> instruct the third party tools to exclude such files would be a more
> complete fix.

STATUS_PENDING_DELETE is concealed under win32 API (seems like a kind of
misdesign if it is true that ERROR_PENDING_DELETE is defined but not actually
used as I saw in someone's blog.). AFAICS native interfaces like NtOpenFile
returns the native status code. That is a bit cumbersome but not too difficult
(as heard).

https://resources.infosecinstitute.com/calling-ntdll-functions-directly/

That said, I haven't succeeded even to open a file with it. (I replaced the path
with NT native path (\??\c:\blah..) since NtOpenFile doesn't accept the
user-friendly(?)  paths).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Intermittent pg_ctl failures on Windows

From

Жарков Роман

Date:

18 July 2019, 12:14:34

I have tested clean REL_11_STABLE.
Commit f02259fe was reverted by df8b5f3e in this branch.
So pg_ctl uses “old” open() function.

regards, Roman

> 18 июля 2019 г., в 15:51, Michael Paquier <michael@paquier.xyz> написал(а):
>
>> On Wed, Jul 17, 2019 at 09:54:16PM +0700, r.zharkov@postgrespro.ru wrote:
>> You are right. I tested branch REL_11_STABLE and it is my mistake.
>>
>> [...]
>>
>> The probability is very small. In one of our tests pg_ctl fails with that
>> error sometime.
>> In a test with multiple frequent restarts the probability is 5-6%. On other
>> machines probability differs.
>> To find out the real error code we used the Process Monitor utility.
>
> I would like to be sure of something here.  Are those failures based
> on what you have tested on REL_11_STABLE or do you still see pg_ctl
> complain about such things on HEAD (post f02259fe as mentioned by Tom
> upthread)?  I would like to think that you mean the former and the
> latter, but this thread mentions only the former.
> --
> Michael

Re: Intermittent pg_ctl failures on Windows

From

Michael Paquier

Date:

19 July 2019, 06:02:08

On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:
> I have tested clean REL_11_STABLE.
> Commit f02259fe was reverted by df8b5f3e in this branch.
> So pg_ctl uses “old” open() function.

Yeah, that was a failure from me, so I tend to be rather very careful
about anything related to Windows.  However, after that we have added
40cfe86 about which nobody has complained yet, and the number of
buildfarm failures about pg_ctl concurrency on HEAD has gone down to
zero since (perhaps I am missing something?).

So, instead of trying to invent a new solution only for stable
branches (which may have its own bugs we would need to deal with only
on stable branches, and only for Windows), why don't we just try to
move forward into back-patching those pieces?  Or it happens that we
still have some potential failures on HEAD and REL_12_STABLE which
would justify some extra handling?  In this case, I would recommend
that we focus on HEAD as a first step, and put things in order there.
--
Michael

Attachment

signature.asc

Re: Intermittent pg_ctl failures on Windows

From

Tom Lane

Date:

19 July 2019, 07:59:43

Michael Paquier <michael@paquier.xyz> writes:
> On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:
>> I have tested clean REL_11_STABLE.
>> Commit f02259fe was reverted by df8b5f3e in this branch.
>> So pg_ctl uses “old” open() function.

> Yeah, that was a failure from me, so I tend to be rather very careful
> about anything related to Windows.  However, after that we have added 
> 40cfe86 about which nobody has complained yet, and the number of
> buildfarm failures about pg_ctl concurrency on HEAD has gone down to
> zero since (perhaps I am missing something?).

Hm, I think 0ba06e0 is actually the relevant change here?  Though
40cfe86 was a necessary cleanup fix.

I'm too tired to dig in the buildfarm database to be sure, but my
impression is that the failure rate is much-better-but-not-zero.
So I'd support back-patching those two commits, but I'm not sure
if that's the end of the conversation.

            regards, tom lane

Re: Intermittent pg_ctl failures on Windows

From

Жарков Роман

Date:

19 July 2019, 08:00:37

First time we found pg_ctl errors while testing our fork.
I reproduced them on REL_11_STABLE.
I found three problems with pg_ctl do_stop/do_restart:
1 - "old" fopen() function;
2 - "delete pending" problem rarely happens with "new" fopen() function when pg_ctl tries to open postmaster.pid file;
3 - cmd.exe shell may block the log file when the server restarts;

Now i try to reproduce it on REL_12_STABLE

regard, Roman

> 19 июля 2019 г., в 10:02, Michael Paquier <michael@paquier.xyz> написал(а):
>
>> On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:
>> I have tested clean REL_11_STABLE.
>> Commit f02259fe was reverted by df8b5f3e in this branch.
>> So pg_ctl uses “old” open() function.
>
> Yeah, that was a failure from me, so I tend to be rather very careful
> about anything related to Windows.  However, after that we have added
> 40cfe86 about which nobody has complained yet, and the number of
> buildfarm failures about pg_ctl concurrency on HEAD has gone down to
> zero since (perhaps I am missing something?).
>
> So, instead of trying to invent a new solution only for stable
> branches (which may have its own bugs we would need to deal with only
> on stable branches, and only for Windows), why don't we just try to
> move forward into back-patching those pieces?  Or it happens that we
> still have some potential failures on HEAD and REL_12_STABLE which
> would justify some extra handling?  In this case, I would recommend
> that we focus on HEAD as a first step, and put things in order there.
> --
> Michael

Re: Intermittent pg_ctl failures on Windows

From

Michael Paquier

Date:

19 July 2019, 08:23:36

On Fri, Jul 19, 2019 at 12:59:43AM -0400, Tom Lane wrote:
> Hm, I think 0ba06e0 is actually the relevant change here?  Though
> 40cfe86 was a necessary cleanup fix.

Oops.  Yes, I meant that.

> I'm too tired to dig in the buildfarm database to be sure, but my
> impression is that the failure rate is much-better-but-not-zero.
> So I'd support back-patching those two commits, but I'm not sure
> if that's the end of the conversation.

Yeah, I am a bit afraid to miss something under the hood...  Let's
see...  Looking at the last 90 days of failures, I actually see one
from 63 days ago from bowerbird which seems to indicate that we are not yet
completely done with it:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-05-16%2010%3A30%3A52
# Running: pg_ctl stop -D
H:\prog\bf\root\HEAD\pgsql.build\src\bin\pg_ctl\tmp_check\tmp_test_qB5F/data
 waiting for server to shut down....pg_ctl: could not open PID file
"H:/prog/bf/root/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/tmp_test_qB5F/data/postmaster.pid":
Permission denied
not ok 16 - pg_ctl stop

And there is this one from jacana with subscription tests from 27 days
ago:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-06-12%2003%3A42%3A55

This makes it at least two, still that's much less than before :(
--
Michael

Attachment

signature.asc

Re: Intermittent pg_ctl failures on Windows

From

r.zharkov@postgrespro.ru

Date:

19 July 2019, 11:24:25

Hello,

I have reproduced the DELETE_PENDING error on the REL_12_STABLE.
I changed the WAITS_PER_SEC to 500000 in the pg_ctl.c file.
Then I launched dummy TAP test ( see attachment ).
The picture in the attachment illustrates the hidden DELETE_PENDING 
error.

-------------------------------------------
Log output:
# Postmaster PID for node "master" is 17796
# restarted 233
# 14:31:05
### Restarting node "master"
# Running: pg_ctl -D 
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata 
-l C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/log/test_master.log 
restart
waiting for server to shut down....pg_ctl: could not open PID file 
"C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata/postmaster.pid": 
Permission denied
Bail out!  system pg_ctl failed
### Stopping node "master" using mode immediate
# Running: pg_ctl -D 
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata 
-m immediate stop
pg_ctl: PID file 
"C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata/postmaster.pid" 
does not exist
Is server running?
Bail out!  system pg_ctl failed
-------------------------------------------


On 2019-07-19 11:59, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> On Thu, Jul 18, 2019 at 04:14:34PM +0700, Жарков Роман wrote:
>>> I have tested clean REL_11_STABLE.
>>> Commit f02259fe was reverted by df8b5f3e in this branch.
>>> So pg_ctl uses “old” open() function.
> 
>> Yeah, that was a failure from me, so I tend to be rather very careful
>> about anything related to Windows.  However, after that we have added
>> 40cfe86 about which nobody has complained yet, and the number of
>> buildfarm failures about pg_ctl concurrency on HEAD has gone down to
>> zero since (perhaps I am missing something?).
> 
> Hm, I think 0ba06e0 is actually the relevant change here?  Though
> 40cfe86 was a necessary cleanup fix.
> 
> I'm too tired to dig in the buildfarm database to be sure, but my
> impression is that the failure rate is much-better-but-not-zero.
> So I'd support back-patching those two commits, but I'm not sure
> if that's the end of the conversation.
> 
>             regards, tom lane

-- 
regards, Roman

Attachment

Re: Intermittent pg_ctl failures on Windows

From

r.zharkov@postgrespro.ru

Date:

19 July 2019, 11:35:42

Hello,

I have reproduced the CMD.EXE error "The process cannot access the file 
because it is being used by another process" the REL_12_STABLE.
I simply launched my dummy TAP test and I got «Процесс не может получить 
доступ к файлу, так как этот файл занят другим процессом.»
( “The process cannot access the file because it is being used by 
another process” in english Windows version )

The pictures in the attachment illustrate the SHARING_VIOLATION error.

-------------------------------------------
# Postmaster PID for node "master" is 10456
# restarted 32
# 13:28:58
### Restarting node "master"
# Running: pg_ctl -D 
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata 
-l C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/log/test_master.log 
restart
waiting for server to shut down.... done
server stopped
waiting for server to start....Процесс не может получить доступ к файлу, 
так как этот файл занят другим процессом.
  stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out!  system pg_ctl failed
### Stopping node "master" using mode immediate
# Running: pg_ctl -D 
C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata 
-m immediate stop
pg_ctl: PID file 
"C:/HOME/Git/postgrespro/src/test/pg_ctl/tmp_check/t_test_master_data/pgdata/postmaster.pid" 
does not exist
Is server running?
Bail out!  system pg_ctl failed
------------------------------------------


-- 
regards, Roman