Thread: [GENERAL] Postgres 10.1 fails to start: server did not start in time

[GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Adam Brusselback
Date:
Hey all,
I am in the process of upgrading to Postgres 10, and am having trouble
getting my streaming replica working.

OS: Debian 9.2
Version: 10.1

I have my primary backed up using pgbackrest, and I restore that to my
replica.  It generates a recovery.conf which has a restore command for
the WAL to pull them from the pgbackrest server.

The recovery succeeds, but when I go to start the cluster on the
standby, it begins to replay the WAL, and does so for about 30
seconds.  Then I get a line in my log saying:

> pg_ctl: server did not start in time

Followed by:

> 2017-11-10 20:27:35.907 UTC [7132] LOG:  received smart shutdown request
> ERROR [063]: : terminated on signal [SIGTERM]
> 2017-11-10 20:27:35.911 P00   INFO: archive-get command end: terminated on signal [SIGTERM]
ERROR [063]: : terminated on signal [SIGTERM]
> 2017-11-10 20:27:35.912 P00   INFO: archive-get command end: terminated on signal [SIGTERM]
> 2017-11-10 20:27:35.978 UTC [7142] LOG:  shutting down
> 2017-11-10 20:27:36.151 UTC [7132] LOG:  database system is shut down

This happens weather I have the server configured as a standby or not.

Any help would be very appreciated.

Thanks,
-Adam


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Tom Lane
Date:
Adam Brusselback <adambrusselback@gmail.com> writes:
> I am in the process of upgrading to Postgres 10, and am having trouble
> getting my streaming replica working.
> OS: Debian 9.2
> Version: 10.1
> I have my primary backed up using pgbackrest, and I restore that to my
> replica.  It generates a recovery.conf which has a restore command for
> the WAL to pull them from the pgbackrest server.
> The recovery succeeds, but when I go to start the cluster on the
> standby, it begins to replay the WAL, and does so for about 30
> seconds.  Then I get a line in my log saying:

>> pg_ctl: server did not start in time

You might want to increase pg_ctl's wait timeout for this situation,
since the default's evidently too little.  However ...

> Followed by:
>> 2017-11-10 20:27:35.907 UTC [7132] LOG:  received smart shutdown request
>> ERROR [063]: : terminated on signal [SIGTERM]

... pg_ctl itself wouldn't decide to forcibly shut down the server
if the timeout expired.  It merely stops waiting and tells you so.
It seems like this must represent misdesign of whatever start script
you're using.  I think you need to complain to the Debian packagers
about that.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Adam Brusselback
Date:
> You might want to increase pg_ctl's wait timeout for this situation,
> since the default's evidently too little.  However ...
Got it, thanks.

> ... pg_ctl itself wouldn't decide to forcibly shut down the server
> if the timeout expired.  It merely stops waiting and tells you so.
> It seems like this must represent misdesign of whatever start script
> you're using.  I think you need to complain to the Debian packagers
> about that.

Hmm, interesting.  I installed from this repo:
http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg main

I was assuming someone in the Postgres project was involved in
packaging it.  Do you know who I should reach out to in that case?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
Joe Conway
Date:
On 11/10/2017 01:01 PM, Adam Brusselback wrote:
>> You might want to increase pg_ctl's wait timeout for this situation,
>> since the default's evidently too little.  However ...
> Got it, thanks.
>
>> ... pg_ctl itself wouldn't decide to forcibly shut down the server
>> if the timeout expired.  It merely stops waiting and tells you so.
>> It seems like this must represent misdesign of whatever start script
>> you're using.  I think you need to complain to the Debian packagers
>> about that.
>
> Hmm, interesting.  I installed from this repo:
> http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg main
>
> I was assuming someone in the Postgres project was involved in
> packaging it.  Do you know who I should reach out to in that case?

I'm pretty sure Christoph Berg is at least involved. I cc'd him here in
case he misses this thread.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Tom Lane
Date:
Adam Brusselback <adambrusselback@gmail.com> writes:
>> You might want to increase pg_ctl's wait timeout for this situation,
>> since the default's evidently too little.  However ...

> Got it, thanks.

>> ... pg_ctl itself wouldn't decide to forcibly shut down the server
>> if the timeout expired.  It merely stops waiting and tells you so.
>> It seems like this must represent misdesign of whatever start script
>> you're using.  I think you need to complain to the Debian packagers
>> about that.

> Hmm, interesting.  I installed from this repo:
> http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg main
> I was assuming someone in the Postgres project was involved in
> packaging it.  Do you know who I should reach out to in that case?

Christoph's probably a good place to start.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
Christoph Berg
Date:
Re: Tom Lane 2017-11-10 <8027.1510347112@sss.pgh.pa.us>
> > The recovery succeeds, but when I go to start the cluster on the
> > standby, it begins to replay the WAL, and does so for about 30
> > seconds.  Then I get a line in my log saying:
> 
> >> pg_ctl: server did not start in time

Hi Adam,

how did you start the server? Via pg_ctlcluster, the init system, or
directly via pg_ctl?

> > Followed by:
> >> 2017-11-10 20:27:35.907 UTC [7132] LOG:  received smart shutdown request
> >> ERROR [063]: : terminated on signal [SIGTERM]
> 
> ... pg_ctl itself wouldn't decide to forcibly shut down the server
> if the timeout expired.  It merely stops waiting and tells you so.
> It seems like this must represent misdesign of whatever start script
> you're using.  I think you need to complain to the Debian packagers
> about that.

pg_ctlcluster doesn't shut down if startup fails, but to be sure, we'd
need to see the full log of whatever initiated the startup. If you are
using systemd, what does `systemctl status postgresql@10-main` report?
If that doesn't have anything, also check journalctl.

Christoph


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Adam Brusselback
Date:
Hey Christoph, I tried starting it with init (service postgresql
start), and pg_ctlcluster.

I modified the pg_ctl.conf and set the timeout higher so I could just
get my cluster back up and running properly, so I can't give you the
info on what systemctl status says at the moment.


On Sat, Nov 11, 2017 at 8:23 AM, Christoph Berg <myon@debian.org> wrote:
> Re: Tom Lane 2017-11-10 <8027.1510347112@sss.pgh.pa.us>
>> > The recovery succeeds, but when I go to start the cluster on the
>> > standby, it begins to replay the WAL, and does so for about 30
>> > seconds.  Then I get a line in my log saying:
>>
>> >> pg_ctl: server did not start in time
>
> Hi Adam,
>
> how did you start the server? Via pg_ctlcluster, the init system, or
> directly via pg_ctl?
>
>> > Followed by:
>> >> 2017-11-10 20:27:35.907 UTC [7132] LOG:  received smart shutdown request
>> >> ERROR [063]: : terminated on signal [SIGTERM]
>>
>> ... pg_ctl itself wouldn't decide to forcibly shut down the server
>> if the timeout expired.  It merely stops waiting and tells you so.
>> It seems like this must represent misdesign of whatever start script
>> you're using.  I think you need to complain to the Debian packagers
>> about that.
>
> pg_ctlcluster doesn't shut down if startup fails, but to be sure, we'd
> need to see the full log of whatever initiated the startup. If you are
> using systemd, what does `systemctl status postgresql@10-main` report?
> If that doesn't have anything, also check journalctl.
>
> Christoph


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
Christoph Berg
Date:
Re: Adam Brusselback 2017-11-11 <CAMjNa7c5shJPR9sDz=VwNZz1LoTTi=dhO1CUCDJL=u8+WLL5pg@mail.gmail.com>
> Hey Christoph, I tried starting it with init (service postgresql
> start), and pg_ctlcluster.
> 
> I modified the pg_ctl.conf and set the timeout higher so I could just
> get my cluster back up and running properly, so I can't give you the
> info on what systemctl status says at the moment.

It looks like it's interference from systemd here. The problem is
easily reproduced by putting '-t 0' into pg_ctl.conf:

● postgresql@10-main.service - PostgreSQL Cluster 10-main  Loaded: loaded (/lib/systemd/system/postgresql@.service;
enabled;vendor preset: enabled)  Active: failed (Result: exit-code) since Sat 2017-11-11 21:10:56 CET; 9ms ago Process:
17946ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 10-main stop (code=exited, status=1/FAILURE)
Process:18000 ExecStart=postgresql@10-main --skip-systemctl-redirect 10-main start (code=exited, status=1/FAILURE)Main
PID:17878 (code=exited, status=0/SUCCESS)
 

Nov 11 21:10:55 lehmann systemd[1]: Starting PostgreSQL Cluster 10-main...
Nov 11 21:10:56 lehmann postgresql@10-main[18000]: Error: /usr/lib/postgresql/10/bin/pg_ctl
/usr/lib/postgresql/10/bin/pg_ctlstart -D /var/lib/postgresql/10/main -l /var/log/postgresql/postgresql-10-main.log -t
0-s -o  -c config_file="/etc/postgresql/10/main/postgresql.conf"  exited with status 1:
 
Nov 11 21:10:56 lehmann postgresql@10-main[18000]: pg_ctl: server did not start in time
Nov 11 21:10:56 lehmann systemd[1]: postgresql@10-main.service: Control process exited, code=exited status=1
Nov 11 21:10:56 lehmann systemd[1]: Failed to start PostgreSQL Cluster 10-main.
Nov 11 21:10:56 lehmann systemd[1]: postgresql@10-main.service: Unit entered failed state.
Nov 11 21:10:56 lehmann systemd[1]: postgresql@10-main.service: Failed with result 'exit-code'.

In other words, systemd will by default stop a service that fails to
start.

I'm investigating if it's a good idea to tell systemd to ignore the
exit code of pg_ctl(cluster). Possibly moving to Type=notify is the
best solution, but not all majors support that yet.

Will report back once I have a solution.

Christoph


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
Christoph Berg
Date:
Re: To Adam Brusselback 2017-11-11 <20171111205316.u56lkmkakdmcx6zm@msg.df7cb.de>
> I'm investigating if it's a good idea to tell systemd to ignore the
> exit code of pg_ctl(cluster).

Telling systemd to ignore ExecStart errors seems to be the correct
solution. The service will still be active, with the startup error
being shown:

● postgresql@10-main.service - PostgreSQL Cluster 10-main  Loaded: loaded (/lib/systemd/system/postgresql@.service;
enabled;vendor preset: enabled)  Active: active (running) since Sun 2017-11-12 13:24:21 CET; 210ms ago Process: 31892
ExecStop=/usr/bin/pg_ctlcluster--skip-systemctl-redirect -m fast 10-main stop (code=exited, status=0/SUCCESS) Process:
31922ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 10-main start (code=exited, status=1/FAILURE)Main PID:
31928(postgres)   Tasks: 8 (limit: 4915)  CGroup: /system.slice/system-postgresql.slice/postgresql@10-main.service
   ├─31928 /usr/lib/postgresql/10/bin/postgres -D /var/lib/postgresql/10/main -c
config_file=/etc/postgresql/10/main/postgresql.conf         ├─31931 postgres: 10/main: checkpointer process
├─31932postgres: 10/main: writer process          ├─31933 postgres: 10/main: wal writer process          ├─31934
postgres:10/main: autovacuum launcher process          ├─31935 postgres: 10/main: archiver process          ├─31936
postgres:10/main: stats collector process          └─31937 postgres: 10/main: bgworker: logical replication launcher
 

Nov 12 13:24:20 lehmann systemd[1]: Starting PostgreSQL Cluster 10-main...
Nov 12 13:24:21 lehmann postgresql@10-main[31922]: Error: /usr/lib/postgresql/10/bin/pg_ctl
/usr/lib/postgresql/10/bin/pg_ctlstart -D /var/lib/postgresql/10/main -l /var/log/postgresql/postgresql-10-main.log -t
0-s -o  -c config_file="/etc/postgresql/10/main/postgresql.conf"  exited with status 1:
 
Nov 12 13:24:21 lehmann postgresql@10-main[31922]: pg_ctl: server did not start in time
Nov 12 13:24:21 lehmann systemd[1]: postgresql@10-main.service: PID file /var/run/postgresql/10-main.pid not readable
(yet?)after start: No such file or directory
 
Nov 12 13:24:21 lehmann systemd[1]: Started PostgreSQL Cluster 10-main.

Fixed in
https://anonscm.debian.org/cgit/pkg-postgresql/postgresql-common.git/commit/?id=fc57e655c71e8f6bcb3010b054f5adbf32a224d7
,thanks for the report!
 

Christoph


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
"Peter J. Holzer"
Date:
On 2017-11-12 13:26:58 +0100, Christoph Berg wrote:
> Re: To Adam Brusselback 2017-11-11 <20171111205316.u56lkmkakdmcx6zm@msg.df7cb.de>
> > I'm investigating if it's a good idea to tell systemd to ignore the
> > exit code of pg_ctl(cluster).
>
> Telling systemd to ignore ExecStart errors seems to be the correct
> solution. The service will still be active, with the startup error
> being shown:

Wouldn't it be better to remove the timeout? If some other service
depends on PostgreSQL it probably shouldn't be startet until PostgreSQL
is really up and services which don't need PostgreSQL (e.g. SSH or X11
login or a web- or mail server) shouldn't depend on it.

One of the purported advantages of systemd over SystemV init is that it
starts up services in parallel, so a service which takes a long (or
infinite) time to start doesn't block other services.
       hp

--   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
Christoph Berg
Date:
Re: Peter J. Holzer 2017-11-12 <20171112173559.m6chmbyf4vz6fu3c@hjp.at>
> Wouldn't it be better to remove the timeout? If some other service
> depends on PostgreSQL it probably shouldn't be startet until PostgreSQL
> is really up and services which don't need PostgreSQL (e.g. SSH or X11
> login or a web- or mail server) shouldn't depend on it.
> 
> One of the purported advantages of systemd over SystemV init is that it
> starts up services in parallel, so a service which takes a long (or
> infinite) time to start doesn't block other services.

If you don't want to block, don't depend on the database service. That
question is independent from the timeout.

Christoph


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Tom Lane
Date:
Christoph Berg <myon@debian.org> writes:
> Re: Peter J. Holzer 2017-11-12 <20171112173559.m6chmbyf4vz6fu3c@hjp.at>
>> Wouldn't it be better to remove the timeout?

> If you don't want to block, don't depend on the database service. That
> question is independent from the timeout.

Agreed, but I think Peter has a point: why is there a timeout at all,
let alone one as short as 30 seconds?  Since systemd doesn't serialize
service starts unnecessarily, there seems little value in giving up
quickly.  And we know that cases such as crash recovery may take more
than that.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
Christoph Berg
Date:
Re: Tom Lane 2017-11-12 <20802.1510513605@sss.pgh.pa.us>
> Agreed, but I think Peter has a point: why is there a timeout at all,
> let alone one as short as 30 seconds?  Since systemd doesn't serialize
> service starts unnecessarily, there seems little value in giving up
> quickly.  And we know that cases such as crash recovery may take more
> than that.

The default systemd timeout seems to be 90s. I have already changed
the systemd timeout to infinity (start) and 1h (stop), so only the
default pg_ctl timeout remains (60s), which I'd rather not override
unilaterally.

https://anonscm.debian.org/cgit/pkg-postgresql/postgresql-common.git/tree/systemd/postgresql@.service#n18

That said, isn't 60s way too small for shutting down larger clusters?
And likewise for starting?

Christoph


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Tom Lane
Date:
Christoph Berg <myon@debian.org> writes:
> The default systemd timeout seems to be 90s. I have already changed
> the systemd timeout to infinity (start) and 1h (stop), so only the
> default pg_ctl timeout remains (60s), which I'd rather not override
> unilaterally.

> That said, isn't 60s way too small for shutting down larger clusters?
> And likewise for starting?

Well, that's tied into the fact that pg_ctl doesn't disturb the server's
state if it gives up waiting.  If it did, we would certainly use a larger
timeout or none at all.  I don't feel a big need to change that default,
but if you have a surrounding script that is going to take adverse action
after a timeout then you need to use a larger value ...
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start intime

From
Andres Freund
Date:
On 2017-11-12 14:26:42 -0500, Tom Lane wrote:
> Christoph Berg <myon@debian.org> writes:
> > The default systemd timeout seems to be 90s. I have already changed
> > the systemd timeout to infinity (start) and 1h (stop), so only the
> > default pg_ctl timeout remains (60s), which I'd rather not override
> > unilaterally.
> 
> > That said, isn't 60s way too small for shutting down larger clusters?
> > And likewise for starting?
> 
> Well, that's tied into the fact that pg_ctl doesn't disturb the server's
> state if it gives up waiting.  If it did, we would certainly use a larger
> timeout or none at all.

Hm. So partially that's also related to the fact that we didn't have a
good way to know whether the server reacted to the shutdown request or
not. With the infrastructure from

commit f13ea95f9e473a43ee4e1baeb94daaf83535d37c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2017-06-28 17:31:24 -0400
   Change pg_ctl to detect server-ready by watching status in postmaster.pid.

we could really do better than just wonder whether our signal to
shutdown was received or not.  There probably should be a quite short
timeout for the server to change status, and then a much longer one for
that shutdown to finish.


> I don't feel a big need to change that default,
> but if you have a surrounding script that is going to take adverse action
> after a timeout then you need to use a larger value ...

Didn't we have to fiddle with this a bunch in the regression tests, to
get things to work properly on slow animals? If we had to do that, other
people had to do so as well. Not the friendliest experience...

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> we could really do better than just wonder whether our signal to
> shutdown was received or not.  There probably should be a quite short
> timeout for the server to change status, and then a much longer one for
> that shutdown to finish.

While I don't want to just raise the timeout, I could get behind a more
thorough rethinking of the behavior there.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general