Thread: Problem during Windows service start
Hi, This is my first posting to the mailing list. Currently our customer uses PostgreSQL 9.5 and hits problem during Windows service start. The Windows service status of the instance is different from actual status. We got the following situation. 1. Register service with 'pg_ctl register -N "PostgreSQL" -U postgres -P xxxx -D D:\data\inst1 -w' 2. Start the instance from the Windows service screen. 3. After 60 seconds, the startup process fails with a timeout. Because crash recovery takes a lot of times. Then, the service status of the instance become "STOPPED", but the instance was running. It cannot be stopped from the Windows service screen (it can be stopped only with pg_ctl). PostgreSQL version : 9.5.12 Operating system : Windows Server 2012 R2 I think this is a bug. I think it has not been fixed in the latest version, is my understanding correct? If it is correct, I will fix it. Regards, SAKAI Teppei
Sakai, Teppei wrote: > This is my first posting to the mailing list. > > Currently our customer uses PostgreSQL 9.5 and hits problem during Windows service start. > The Windows service status of the instance is different from actual status. > > We got the following situation. > 1. Register service with 'pg_ctl register -N "PostgreSQL" -U postgres -P xxxx -D D:\data\inst1 -w' > 2. Start the instance from the Windows service screen. > 3. After 60 seconds, the startup process fails with a timeout. > Because crash recovery takes a lot of times. > > Then, the service status of the instance become "STOPPED", > but the instance was running. > It cannot be stopped from the Windows service screen (it can be stopped only with pg_ctl). > > PostgreSQL version : 9.5.12 > Operating system : Windows Server 2012 R2 > > I think this is a bug. > I think it has not been fixed in the latest version, is my understanding correct? > If it is correct, I will fix it. I agree that this is not nice. How do you propose to fix it? Yours, Laurenz Albe
Hi, This thread is inactive, but I want to solve this problem. I think this problem rarely occurs in 10 or later version because of commit [1]. Because "pg_ctl start -w" wait for onlyPID file creation. It means that timeout is not occurred even if crash recovery takes a lot of times. However, 9.6 or earlier still wait for long time recovery complete. > How do you propose to fix it? I think there are two solutions. One solution is that status of Windows Service should be changed to "SERVICE_RUNNING" even if timeout is occurred becauseof long time recovery. I attached the patch of this solution. Another solution is to backport commit [1] to 9.6 or earlier version. However this fix change the content of PID file, soincompatible change, I think. I would appreciate it if you give any comments. [1] https://github.com/postgres/postgres/commit/f13ea95f9e473a43ee4e1baeb94daaf83535d37c Regards, Daisuke, Higuchi
Attachment
On Wed, Jan 09, 2019 at 05:28:29AM +0000, Higuchi, Daisuke wrote: > One solution is that status of Windows Service should be changed to > "SERVICE_RUNNING" even if timeout is occurred because of long time > recovery. I attached the patch of this solution. You should register this patch to the next commit fest in the section for bug fixes to not lose sight of it; https://commitfest.postgresql.org/22/ + case PQPING_NO_ATTEMPT: + write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed because of wrong parameter or something\n")); + pgwin32_SetServiceStatus(SERVICE_STOPPED); I haven't put much thoughts into what you propose here, but this status message is not really helpful for the user. > Another solution is to backport commit [1] to 9.6 or earlier > version. However this fix change the content of PID file, so > incompatible change, I think. Yeah, let's not go down that road. -- Michael
Attachment
Michael Paquier wrote: > You should register this patch to the next commit fest in the section for bug fixes to not lose sight of it; Thank you for picking up my post. I registered to the next CF. > I haven't put much thoughts into what you propose here, but this status message is not really helpful for the user. Thank you for review. I attached the updated patch. There are some reasons for server startup failing, so I changed the messagelike following: + case PQPING_NO_RESPONSE: + case PQPING_NO_ATTEMPT: + write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n")); + pgwin32_SetServiceStatus(SERVICE_STOPPED); Regards, Daisuke, Higuchi
Attachment
Hi Higuchi-san, (1) What made you think this problem rarely occurs in PG 10 or later? Looking at the following code, this seems to happen inPG 10+ too. if (do_wait) { write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n")); if (wait_for_postmaster(postmasterPID, true) != POSTMASTER_READY) { write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n")); pgwin32_SetServiceStatus(SERVICE_STOPPED); return; } write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n")); } (2) What state should we consider SERVICE_RUNNING as? Isn't it the state where the server has completed startup processing andaccepts connections? If no, how is it different from SERVICE_STARTING? (I know that when -w (wait) is not specified, the status becomes SERVICE_RUNNING whether or not the server completes startupprocessing...) (3) + write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server startup timed out but might continue in the background\n")); This message is new, isn't it? I think the existing message "Time out..." is enough. (4) + write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n")); The first sentence (phenomenon) and the second line (detail or action) should be separated with a newline. Below are someexamples in pg_ctl.c. Note that write_stderr() writes to the eventlog when running under a Windows service. write_stderr(_("%s: could not start server\n" "Examine the log output.\n"), write_stderr(_("The program \"%s\" was found by \"%s\"\n" "but was not the same version as %s.\n" "Check your installation.\n"), Regards Takayuki Tsunakawa
Hi,
If wait_for_postmaster returns POSTMASTER_STILL_STARTING will it be correct to set the status of windows service to SERVICE_START_PENDING ?
I would like to take this up if no one is working on this.
Regards,
Ram.
Hi, Thank you for picking up this and I'm sorry for delay reply. >> If wait_for_postmaster returns POSTMASTER_STILL_STARTING will it >> be correct to set the status of windows service to SERVICE_START_PENDING ? Yes, I think this is the best. Currently, I do not have good solution to change the status from SERVICE_START_PENDING toSERVICE_RUNNING after recovery is completed. >> I would like to take this up if no one is working on this. Thank you for your proposing! I would like to ask if possible. I do not have much time to think about this topic now... Regards, Higuchi
Hi,
Please find the proposed patch for review. I will attach it to commitfest as well
Regards,
Ram.
Attachment
Hi, + case POSTMASTER_STILL_STARTING: + write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n")); + pgwin32_SetServiceStatus(SERVICE_START_PENDING); + return; Could this patch solve first post's problem [1] ? I think no one could change the service status to SERVICE_RUNNING even if the server has been started properly after timeoutis occurred. In 9.6 or earlier, the main use case where the problem is occurred is when timeout is occured because of long time recovery.Even if recovery takes a lot of time and timeout is ocurred, recovery continues in background. In this case, I wantto set the status to SERVICE_RUNNING after recovery is completed. In 10 or later, I understand wait_for_postmaster does not wait until recovery is completed, so I think this problem rarelyoccurs in PG 10 or later. [1] https://www.postgresql.org/message-id/99C4246E73ED1B478BBB9671718426203E37F485@G01JPEXMBKW03 Regards, Daisuke, Higuchi
Sorry in advance for link-breaking message, but the original mail was too old and gmail doesn't allow me to craft required headers to link to it. https://www.postgresql.org/message-id/CAKm4Xs71Ma8bV1fY6Gfz9Mg3AKmiHuoJNpxeDVF_KTVOKoy1WQ%40mail.gmail.com > Please find the proposed patch for review. I will attach it to > commitfest as well Pacemaker suffers the same thing. We suggest our customers that "start server alone to perform recovery then start pacemaker if it is expected to take a long time for recovery so that reaches time out". I don't think it is good think to let status SERVICE_RUNNING although it actually is not (yet). I think the right direction here is that, if pg_ctl returns by timeout, pgwin32_ServiceMain kills the starting server then report something like "timedout and server was stopped, please make sure the server not to take a long time to perform recovery.". Thougts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2019-Jul-24, Kyotaro Horiguchi wrote: > > Please find the proposed patch for review. I will attach it to > > commitfest as well > > Pacemaker suffers the same thing. We suggest our customers that "start > server alone to perform recovery then start pacemaker if it is > expected to take a long time for recovery so that reaches time out". > > I don't think it is good think to let status SERVICE_RUNNING although > it actually is not (yet). I think the right direction here is that, if > pg_ctl returns by timeout, pgwin32_ServiceMain kills the starting > server then report something like "timedout and server was stopped, > please make sure the server not to take a long time to perform > recovery.". I'm not sure that's a great reaction; it makes total recovery time even longer. How would the user ensure that recovery takes a shorter time? We'd be forcing them to start the service over and over, until recovery completes. Can't we have pg_ctl just continue to wait indefinitely? So we'd set SERVICE_START_PENDING when wait_for_postmaster is out of patience, then loop again -- until recovery completes. Exiting pg_ctl on timeout seems reasonable for interactive use, but maybe for service use it's not reasonable. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 05, 2019 at 07:09:45PM -0400, Alvaro Herrera from 2ndQuadrant wrote: > Can't we have pg_ctl just continue to wait indefinitely? So we'd set > SERVICE_START_PENDING when wait_for_postmaster is out of patience, then > loop again -- until recovery completes. Exiting pg_ctl on timeout seems > reasonable for interactive use, but maybe for service use it's not > reasonable. The root of the problem here is that the time recovery takes is not something that can be guessed, and that service registering happens in the background. It depends on the time the last checkpoint occurred, the load on the machine involved and the WAL operations done. So it seems to me that Alvaro's idea is something which we could work on for at least HEAD. There is also the path of providing a longer timeout, still that's just a workaround.. My understanding is that this could be qualified as a bug because of the fact that we require using again pg_ctl after starting the service from the windows service control center. So, are there plans to move on with this patch? It is waiting on author for some time now. -- Michael
Attachment
On Thu, Nov 07, 2019 at 12:55:13PM +0900, Michael Paquier wrote: > So, are there plans to move on with this patch? It is waiting on > author for some time now. Seeing no activity from the author or even the reviewer, I have marked the patch as returned with feedback for now. I am not actually fully convinced that this should be backpatched either, so it could be done as a future improvement. -- Michael