Thread: Problem during Windows service start

Problem during Windows service start

From
"Sakai, Teppei"
Date:
Hi,

This is my first posting to the mailing list.

Currently our customer uses PostgreSQL 9.5 and hits problem during Windows service start.
The Windows service status of the instance is different from actual status.

We got the following situation.
1. Register service with 'pg_ctl register -N "PostgreSQL" -U postgres -P xxxx -D D:\data\inst1 -w'
2. Start the instance from the Windows service screen.
3. After 60 seconds, the startup process fails with a timeout.
   Because crash recovery takes a lot of times.

Then, the service status of the instance become "STOPPED",
but the instance was running.
It cannot be stopped from the Windows service screen (it can be stopped only with pg_ctl).

PostgreSQL version : 9.5.12
Operating system : Windows Server 2012 R2

I think this is a bug.
I think it has not been fixed in the latest version, is my understanding correct?
If it is correct, I will fix it.

Regards,
SAKAI Teppei



Re: Problem during Windows service start

From
Laurenz Albe
Date:
Sakai, Teppei wrote:
> This is my first posting to the mailing list.
> 
> Currently our customer uses PostgreSQL 9.5 and hits problem during Windows service start.
> The Windows service status of the instance is different from actual status.
> 
> We got the following situation.
> 1. Register service with 'pg_ctl register -N "PostgreSQL" -U postgres -P xxxx -D D:\data\inst1 -w'
> 2. Start the instance from the Windows service screen.
> 3. After 60 seconds, the startup process fails with a timeout.
>    Because crash recovery takes a lot of times.
> 
> Then, the service status of the instance become "STOPPED",
> but the instance was running.
> It cannot be stopped from the Windows service screen (it can be stopped only with pg_ctl).
> 
> PostgreSQL version : 9.5.12
> Operating system : Windows Server 2012 R2
> 
> I think this is a bug.
> I think it has not been fixed in the latest version, is my understanding correct?
> If it is correct, I will fix it.

I agree that this is not nice.

How do you propose to fix it?

Yours,
Laurenz Albe


RE: Problem during Windows service start

From
"Higuchi, Daisuke"
Date:
Hi, 

This thread is inactive, but I want to solve this problem. 
I think this problem rarely occurs in 10 or later version because of commit [1]. Because "pg_ctl start -w" wait for
onlyPID file creation.  It means that timeout is not occurred even if crash recovery takes a lot of times. 
 
However, 9.6 or earlier still wait for long time recovery complete. 

> How do you propose to fix it?

I think there are two solutions. 

One solution is that status of Windows Service should be changed to "SERVICE_RUNNING" even if timeout is occurred
becauseof long time recovery. I attached the patch of this solution. 
 

Another solution is to backport commit [1] to 9.6 or earlier version. However this fix change the content of PID file,
soincompatible change, I think. 
 

I would appreciate it if you give any comments. 

[1] https://github.com/postgres/postgres/commit/f13ea95f9e473a43ee4e1baeb94daaf83535d37c

Regards, 
Daisuke, Higuchi


Attachment

Re: Problem during Windows service start

From
Michael Paquier
Date:
On Wed, Jan 09, 2019 at 05:28:29AM +0000, Higuchi, Daisuke wrote:
> One solution is that status of Windows Service should be changed to
> "SERVICE_RUNNING" even if timeout is occurred because of long time
> recovery. I attached the patch of this solution.

You should register this patch to the next commit fest in the section
for bug fixes to not lose sight of it;
https://commitfest.postgresql.org/22/

+    case PQPING_NO_ATTEMPT:
+        write_eventlog(EVENTLOG_ERROR_TYPE,
_("Server startup failed because of wrong parameter or something\n"));
+        pgwin32_SetServiceStatus(SERVICE_STOPPED);

I haven't put much thoughts into what you propose here, but this status
message is not really helpful for the user.

> Another solution is to backport commit [1] to 9.6 or earlier
> version. However this fix change the content of PID file, so
> incompatible change, I think.

Yeah, let's not go down that road.
--
Michael

Attachment

RE: Problem during Windows service start

From
"Higuchi, Daisuke"
Date:
Michael Paquier wrote:
> You should register this patch to the next commit fest in the section for bug fixes to not lose sight of it; 

Thank you for picking up my post. I registered to the next CF. 

> I haven't put much thoughts into what you propose here, but this status message is not really helpful for the user.

Thank you for review. I attached the updated patch. There are some reasons for server startup failing, so I changed the
messagelike following: 
 

+            case PQPING_NO_RESPONSE:
+            case PQPING_NO_ATTEMPT:
+                write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n"));
+                pgwin32_SetServiceStatus(SERVICE_STOPPED);


Regards, 
Daisuke, Higuchi


Attachment

RE: Problem during Windows service start

From
"Tsunakawa, Takayuki"
Date:
Hi Higuchi-san,


(1)
What made you think this problem rarely occurs in PG 10 or later?  Looking at the following code, this seems to happen
inPG 10+ too.
 

    if (do_wait)
    {
        write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
        if (wait_for_postmaster(postmasterPID, true) != POSTMASTER_READY)
        {
            write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
            pgwin32_SetServiceStatus(SERVICE_STOPPED);
            return;
        }
        write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
    }


(2)
What state should we consider SERVICE_RUNNING as?  Isn't it the state where the server has completed startup processing
andaccepts connections?  If no, how is it different from SERVICE_STARTING?
 

(I know that when -w (wait) is not specified, the status becomes SERVICE_RUNNING whether or not the server completes
startupprocessing...)
 


(3)
+                write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server startup timed out but might continue in the
background\n"));

This message is new, isn't it?  I think the existing message "Time out..." is enough.


(4)
+                write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n"));

The first sentence (phenomenon) and the second line (detail or action) should be separated with a newline.  Below are
someexamples in pg_ctl.c.  Note that write_stderr() writes to the eventlog when running under a Windows service.
 

                write_stderr(_("%s: could not start server\n"
                               "Examine the log output.\n"),

            write_stderr(_("The program \"%s\" was found by \"%s\"\n"
                           "but was not the same version as %s.\n"
                           "Check your installation.\n"),


Regards
Takayuki Tsunakawa



Re: Problem during Windows service start

From
Ramanarayana
Date:
Hi,

If wait_for_postmaster returns POSTMASTER_STILL_STARTING will it be correct to set the status of windows service to SERVICE_START_PENDING ?

I would like to take this up if no one is working on this.

Regards,
Ram.

RE: Problem during Windows service start

From
"Higuchi, Daisuke"
Date:
Hi, 

Thank you for picking up this and I'm sorry for delay reply. 

>> If wait_for_postmaster returns POSTMASTER_STILL_STARTING will it 
>> be correct to set the status of windows service to SERVICE_START_PENDING ?

Yes, I think this is the best. Currently, I do not have good solution to change the status from SERVICE_START_PENDING
toSERVICE_RUNNING after recovery is completed. 
 

>> I would like to take this up if no one is working on this.

Thank you for your proposing! I would like to ask if possible. 
I do not have much time to think about this topic now...

Regards, 
Higuchi 


Re: Problem during Windows service start

From
Ramanarayana
Date:
Hi,

Please find the proposed patch for review. I will attach it to commitfest as well

Regards,
Ram.
Attachment

RE: Problem during Windows service start

From
"Higuchi, Daisuke"
Date:
Hi, 

+           case POSTMASTER_STILL_STARTING:
+               write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
+               pgwin32_SetServiceStatus(SERVICE_START_PENDING);
+               return;

Could this patch solve first post's problem [1] ?
I think no one could change the service status to SERVICE_RUNNING even if the server has been started properly after
timeoutis occurred. 
 

In 9.6 or earlier, the main use case where the problem is occurred is when timeout is occured because of long time
recovery.Even if recovery takes a lot of time and timeout is ocurred, recovery continues in background. In this case, I
wantto set the status to SERVICE_RUNNING after recovery is completed. 
 

In 10 or later, I understand wait_for_postmaster does not wait until recovery is completed, so I think this problem
rarelyoccurs in PG 10 or later.
 

[1] https://www.postgresql.org/message-id/99C4246E73ED1B478BBB9671718426203E37F485@G01JPEXMBKW03

Regards, 
Daisuke, Higuchi


Re: Problem during Windows service start

From
Kyotaro Horiguchi
Date:
Sorry in advance for link-breaking message, but the original mail was
too old and gmail doesn't allow me to craft required headers to link
to it.

https://www.postgresql.org/message-id/CAKm4Xs71Ma8bV1fY6Gfz9Mg3AKmiHuoJNpxeDVF_KTVOKoy1WQ%40mail.gmail.com

> Please find the proposed patch for review. I will attach it to
> commitfest as well

Pacemaker suffers the same thing. We suggest our customers that "start
server alone to perform recovery then start pacemaker if it is
expected to take a long time for recovery so that reaches time out".

I don't think it is good think to let status SERVICE_RUNNING although
it actually is not (yet). I think the right direction here is that, if
pg_ctl returns by timeout, pgwin32_ServiceMain kills the starting
server then report something like "timedout and server was stopped,
please make sure the server not to take a long time to perform
recovery.".

Thougts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Problem during Windows service start

From
Alvaro Herrera from 2ndQuadrant
Date:
On 2019-Jul-24, Kyotaro Horiguchi wrote:

> > Please find the proposed patch for review. I will attach it to
> > commitfest as well
> 
> Pacemaker suffers the same thing. We suggest our customers that "start
> server alone to perform recovery then start pacemaker if it is
> expected to take a long time for recovery so that reaches time out".
> 
> I don't think it is good think to let status SERVICE_RUNNING although
> it actually is not (yet). I think the right direction here is that, if
> pg_ctl returns by timeout, pgwin32_ServiceMain kills the starting
> server then report something like "timedout and server was stopped,
> please make sure the server not to take a long time to perform
> recovery.".

I'm not sure that's a great reaction; it makes total recovery time
even longer.  How would the user ensure that recovery takes a shorter
time?  We'd be forcing them to start the service over and over, until
recovery completes.

Can't we have pg_ctl just continue to wait indefinitely?  So we'd set
SERVICE_START_PENDING when wait_for_postmaster is out of patience, then
loop again -- until recovery completes.  Exiting pg_ctl on timeout seems
reasonable for interactive use, but maybe for service use it's not
reasonable.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Problem during Windows service start

From
Michael Paquier
Date:
On Thu, Sep 05, 2019 at 07:09:45PM -0400, Alvaro Herrera from 2ndQuadrant wrote:
> Can't we have pg_ctl just continue to wait indefinitely?  So we'd set
> SERVICE_START_PENDING when wait_for_postmaster is out of patience, then
> loop again -- until recovery completes.  Exiting pg_ctl on timeout seems
> reasonable for interactive use, but maybe for service use it's not
> reasonable.

The root of the problem here is that the time recovery takes is not
something that can be guessed, and that service registering happens in
the background.  It depends on the time the last checkpoint occurred,
the load on the machine involved and the WAL operations done.  So it
seems to me that Alvaro's idea is something which we could work on for
at least HEAD.  There is also the path of providing a longer timeout,
still that's just a workaround..

My understanding is that this could be qualified as a bug because of
the fact that we require using again pg_ctl after starting the service
from the windows service control center.

So, are there plans to move on with this patch?  It is waiting on
author for some time now.
--
Michael

Attachment

Re: Problem during Windows service start

From
Michael Paquier
Date:
On Thu, Nov 07, 2019 at 12:55:13PM +0900, Michael Paquier wrote:
> So, are there plans to move on with this patch?  It is waiting on
> author for some time now.

Seeing no activity from the author or even the reviewer, I have marked
the patch as returned with feedback for now.  I am not actually fully
convinced that this should be backpatched either, so it could be done
as a future improvement.
--
Michael

Attachment