Thread: [bug fix] "pg_ctl stop" times out when it should respond quickly

[bug fix] "pg_ctl stop" times out when it should respond quickly

From
"MauMau"
Date:
Hello,

I've encountered a small bug and fixed it.  I guess this occurs on all major
releases.  I saw this happen on 9.2 and 9.4devel.  Please find attached the
patch and commit this.


[Problem]
If I mistakenly set an invalid value to listen_addresses, say '-1', and
start the database server, it fails to start as follows.  In my environment
(RHEL6 for Intel64), it takes about 15 seconds before postgres prints the
messages.  This is OK.

[maumau@myhost pgdata]$ pg_ctl -w start
waiting for server to start........................LOG:  could not translate
host name "-1", service "5450" to address: Temporary failure in name
resolution
WARNING:  could not create listen socket for "-1"
FATAL:  could not create any TCP/IP sockets
 stopped waiting
pg_ctl: could not start server
Examine the log output.
[maumau@myhost pgdata]$

When I start the server without -w and try to stop it, "pg_ctl stop" waits
for 60 seconds and timed out before it fails.  This is what I'm seeing as a
problem.  I expected "pg_ctl stop" to respond quickly with success or
failure depending on the timing.

[maumau@myhost pgdata]$ pg_ctl start
server starting
...(a few seconds later)
[maumau@myhost ~]$ pg_ctl stop
waiting for server to shut
down.................................................
.............. failed
pg_ctl: server does not shut down
HINT: The "-m fast" option immediately disconnects sessions rather than
waiting for session-initiated disconnection.
[maumau@myhost ~]$


[Cause]
The problem occurs in the sequence below:

1. postmaster creates $PGDATA/postmaster.pid.
2. postmaster tries to resolve the value of listen_addresses to IP
addresses.  This took about 15 seconds in my failure scenario.
3. During 2, pg_ctl sends SIGTERM to postmaster.
4. postmaster terminates immediately without deleting
$PGDATA/postmaster.pid.  This is because it hasn't set signal handlers yet.
5. "pg_ctl stop" waits in a loop until $PGDATA/postmaster.pid disappears.
But the file does not disappear and it times out.


[Fix]
Make pg_ctl check if postmaster is still alive, because postmaster might
have crashed unexpectedly.


Regards
MauMau

Attachment

Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes:
> The problem occurs in the sequence below:

> 1. postmaster creates $PGDATA/postmaster.pid.
> 2. postmaster tries to resolve the value of listen_addresses to IP 
> addresses.  This took about 15 seconds in my failure scenario.
> 3. During 2, pg_ctl sends SIGTERM to postmaster.
> 4. postmaster terminates immediately without deleting 
> $PGDATA/postmaster.pid.  This is because it hasn't set signal handlers yet.
> 5. "pg_ctl stop" waits in a loop until $PGDATA/postmaster.pid disappears. 
> But the file does not disappear and it times out.

Hm.  I wonder if we shouldn't block SIGTERM etc. earlier.  It hardly seems
improbable that such signals would arrive during a slow startup.

> *** 907,913 ****
>   
>           for (cnt = 0; cnt < wait_seconds; cnt++)
>           {
> !             if ((pid = get_pgpid()) != 0)
>               {
>                   print_msg(".");
>                   pg_usleep(1000000);        /* 1 sec */
> --- 907,914 ----
>   
>           for (cnt = 0; cnt < wait_seconds; cnt++)
>           {
> !             if ((pid = get_pgpid()) != 0 &&
> !                 postmaster_is_alive((pid_t) pid))
>               {
>                   print_msg(".");
>                   pg_usleep(1000000);        /* 1 sec */

If you're going to do a postmaster_is_alive check, why bother with
repeated get_pgpid()?

I think the reason why it was coded like that was that we hadn't written
postmaster_is_alive() yet, or maybe we had but didn't want to trust it.
However, with the coding you have here, we're fully exposed to any failure
modes postmaster_is_alive() may have; so there's not a lot of value in
accepting those and get_pgpid's failure modes too.
        regards, tom lane



Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
"MauMau"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> I think the reason why it was coded like that was that we hadn't written
> postmaster_is_alive() yet, or maybe we had but didn't want to trust it.
> However, with the coding you have here, we're fully exposed to any failure
> modes postmaster_is_alive() may have; so there's not a lot of value in
> accepting those and get_pgpid's failure modes too.

Thank you for reviewing the patch so quickly.  You are right, I don't think 
get_pgpid() here is no longer necessary.  If the pid changes in one second, 
i.e. the original postgres terminates and "pg_ctl start" starts another one, 
"pg_ctl stop" can terminate successfully because the original postgres it 
was waiting for actually terminated.

I'll submit the revised patch tomorrow.

Regards
MauMau




Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
"MauMau"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> If you're going to do a postmaster_is_alive check, why bother with
> repeated get_pgpid()?

As I said yesterday, I removed get_pgpid() calls.  I'll add this patch to
2014-1 commitfest this weekend if it is not committed until then.

Regards
MauMau

Attachment

Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Peter Eisentraut
Date:
On 12/5/13, 7:07 AM, MauMau wrote:
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> If you're going to do a postmaster_is_alive check, why bother with
>> repeated get_pgpid()?
> 
> As I said yesterday, I removed get_pgpid() calls.  I'll add this patch
> to 2014-1 commitfest this weekend if it is not committed until then.

This patch breaks the regression tests:
    xml                      ... ok
test stats                    ... ok
============== shutting down postmaster               ==============
<waits a long time>
pg_ctl: server does not shut down

pg_regress: could not stop postmaster: exit code was 256




Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
"MauMau"
Date:
From: "Peter Eisentraut" <peter_e@gmx.net>
> This patch breaks the regression tests:
>
>     xml                      ... ok
> test stats                    ... ok
> ============== shutting down postmaster               ==============
> <waits a long time>
> pg_ctl: server does not shut down
>
> pg_regress: could not stop postmaster: exit code was 256

pg_ctl timed out waiting for the zombie postgres.

maumau 19621 18849  0 15:21 pts/9    00:00:00 [postgres] <defunct>
maumau 20253 18849  0 15:22 pts/9    00:00:00
/maumau/postgresql-9.4/src/test/regress/./tmp_check/install//maumau/pgsql/bin/pg_ctl
stop -D /maumau/postgresql-9.4/src/test/regress/./tmp_check/data -s -m fast

pg_regress must wait for postgres to terminate by calling waitpid(), because
it invoked postgres directly.  The attached pg_regress_pg_stop.patch does
this.  If you like the combination of this and the original fix for pg_ctl
in one patch, please use pg_stop_fail_v3.patch.

Sorry for causing trouble.


Regards
MauMau

Attachment

Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Peter Eisentraut
Date:
On 12/25/13, 6:40 AM, MauMau wrote:
> pg_regress must wait for postgres to terminate by calling waitpid(),
> because it invoked postgres directly.  The attached
> pg_regress_pg_stop.patch does this.  If you like the combination of this
> and the original fix for pg_ctl in one patch, please use
> pg_stop_fail_v3.patch.

This patch doesn't apply.




Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
"MauMau"
Date:
From: "Peter Eisentraut" <peter_e@gmx.net>
> On 12/25/13, 6:40 AM, MauMau wrote:
>> pg_regress must wait for postgres to terminate by calling waitpid(),
>> because it invoked postgres directly.  The attached
>> pg_regress_pg_stop.patch does this.  If you like the combination of this
>> and the original fix for pg_ctl in one patch, please use
>> pg_stop_fail_v3.patch.
>
> This patch doesn't apply.

Is that true?  Today, I downloaded the following file (whose timestamp was 
2014-Jan-04), and could apply pg_stop_fail_v3.patch cleanly.  The "make 
check" succeeded.

http://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz

Could you confirm again and tell me what problem is happening?

Regards
MauMau




Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Michael Paquier
Date:
On Sun, Jan 5, 2014 at 3:49 PM, MauMau <maumau307@gmail.com> wrote:
> Could you confirm again and tell me what problem is happening?
FWIW, I just quickly tested those two patches independently and got
them correctly applied with patch -p1 < $PATCH on master at edc4345.
They compiled and passed as well make check.
Regards,
-- 
Michael



Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Ronan Dunklau
Date:
Le mardi 7 janvier 2014 17:05:03 Michael Paquier a écrit :
> On Sun, Jan 5, 2014 at 3:49 PM, MauMau <maumau307@gmail.com> wrote:
> > Could you confirm again and tell me what problem is happening?
>
> FWIW, I just quickly tested those two patches independently and got
> them correctly applied with patch -p1 < $PATCH on master at edc4345.
> They compiled and passed as well make check.
> Regards,

Both patches apply cleanly, I'll focus on the pg_stop_fail_v3 patch.

Tests are running correctly.

The bug this patch is supposed to fix has been reproduced on current HEAD, and
cannot be reproduced using the patch.

Previous concerns about using both get_pgpid and postmaster_is_alive are
adressed.

There is no regression tests covering this bugfix, althought I don't know if
it would be practical to implement them.


--
Ronan Dunklau
http://dalibo.com - http://dalibo.org

Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
"MauMau"
Date:
From: "Ronan Dunklau" <ronan.dunklau@dalibo.com>
> There is no regression tests covering this bugfix, althought I don't know 
> if
> it would be practical to implement them.

Thanks for reviewing the patch.  I'm glad to know that it seems OK. 
Unfortunately, the current regression test system cannot handle the testing 
of pg_ctl.

Regards
MauMau





Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Alvaro Herrera
Date:
MauMau escribió:

> pg_ctl timed out waiting for the zombie postgres.
> 
> maumau 19621 18849  0 15:21 pts/9    00:00:00 [postgres] <defunct>
> maumau 20253 18849  0 15:22 pts/9    00:00:00
/maumau/postgresql-9.4/src/test/regress/./tmp_check/install//maumau/pgsql/bin/pg_ctl
> stop -D /maumau/postgresql-9.4/src/test/regress/./tmp_check/data -s
> -m fast
> 
> pg_regress must wait for postgres to terminate by calling waitpid(),
> because it invoked postgres directly.  The attached
> pg_regress_pg_stop.patch does this.  If you like the combination of
> this and the original fix for pg_ctl in one patch, please use
> pg_stop_fail_v3.patch.

The pg_regress part is ugly.  However, pg_regress is doing something
unusual when starting postmaster itself, so the ugly coding to stop it
seems to match.  If we wanted to avoid the ugliness here, the right fix
would be to use pg_ctl to start postmaster as well as to stop it.  But
that'd come at a price, because we would need more ugly code to figure
out postmaster's PID.  All in all, the compromise proposed by this patch
seems acceptable.  If we really wanted to make all this real pretty, we
could provide a "libpg_ctl" library to start and stop postmaster, as
well as query the PID.  Probably not worth the trouble.

I would apply this patch to all supported branches after this week's
release.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Michael Paquier
Date:
On Tue, Feb 18, 2014 at 1:29 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> MauMau escribió:
> The pg_regress part is ugly.  However, pg_regress is doing something
> unusual when starting postmaster itself, so the ugly coding to stop it
> seems to match.  If we wanted to avoid the ugliness here, the right fix
> would be to use pg_ctl to start postmaster as well as to stop it.  But
> that'd come at a price, because we would need more ugly code to figure
> out postmaster's PID.  All in all, the compromise proposed by this patch
> seems acceptable.  If we really wanted to make all this real pretty, we
> could provide a "libpg_ctl" library to start and stop postmaster, as
> well as query the PID.  Probably not worth the trouble.
This might not be worth the trouble for this bug, but actually it
could be useful to many third-part tools and extensions to have a
common and generic way to do things. I have seen many utilities using
a copy/paste of pg_ctl functions and still maintain some of them...
Regards,
--
Michael



Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Robert Haas
Date:
On Mon, Feb 17, 2014 at 11:29 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> The pg_regress part is ugly.  However, pg_regress is doing something
> unusual when starting postmaster itself, so the ugly coding to stop it
> seems to match.  If we wanted to avoid the ugliness here, the right fix
> would be to use pg_ctl to start postmaster as well as to stop it.

I wonder if this would change the behavior in cases where we hit ^C
during the regression tests.  Right now I think that kills the
postmaster as well as pg_regress, but if we used pg_ctl, it might not,
because pg_regress uses fork()+exec(), but pg_ctl uses system() to
launch a shell which is in turn instructed to background the
postmaster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Feb 17, 2014 at 11:29 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> The pg_regress part is ugly.  However, pg_regress is doing something
>> unusual when starting postmaster itself, so the ugly coding to stop it
>> seems to match.  If we wanted to avoid the ugliness here, the right fix
>> would be to use pg_ctl to start postmaster as well as to stop it.

> I wonder if this would change the behavior in cases where we hit ^C
> during the regression tests.  Right now I think that kills the
> postmaster as well as pg_regress, but if we used pg_ctl, it might not,
> because pg_regress uses fork()+exec(), but pg_ctl uses system() to
> launch a shell which is in turn instructed to background the
> postmaster.

After re-reading this thread, I am convinced that this patch is failing
to fix the real problem, which is that the postmaster isn't defending
against early arrival of a signal; see
http://www.postgresql.org/message-id/30805.1386110129@sss.pgh.pa.us

Hacking pg_ctl is a band-aid solution, and given these concerns,
not a particularly safe band-aid.

I experimented with simply moving postmaster.c's initialization of signals
further up, and that seems to fix the complained-of problem just fine.
It could be an issue if anything in postmaster initialization is expecting
to run with signals unblocked, but I don't know what that would be.
So I suggest we just do the attached.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7a2c45a..bb771a6 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** PostmasterMain(int argc, char *argv[])
*** 562,567 ****
--- 562,597 ----
      getInstallationPaths(argv[0]);

      /*
+      * Set up signal handlers for the postmaster process.
+      *
+      * CAUTION: when changing this list, check for side-effects on the signal
+      * handling setup of child processes.  See tcop/postgres.c,
+      * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
+      * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
+      * postmaster/syslogger.c, postmaster/bgworker.c and
+      * postmaster/checkpointer.c.
+      */
+     pqinitmask();
+     PG_SETMASK(&BlockSig);
+
+     pqsignal(SIGHUP, SIGHUP_handler);    /* reread config file and have
+                                          * children do same */
+     pqsignal(SIGINT, pmdie);    /* send SIGTERM and shut down */
+     pqsignal(SIGQUIT, pmdie);    /* send SIGQUIT and die */
+     pqsignal(SIGTERM, pmdie);    /* wait for children and shut down */
+     pqsignal(SIGALRM, SIG_IGN); /* ignored */
+     pqsignal(SIGPIPE, SIG_IGN); /* ignored */
+     pqsignal(SIGUSR1, sigusr1_handler); /* message from child process */
+     pqsignal(SIGUSR2, dummy_handler);    /* unused, reserve for children */
+     pqsignal(SIGCHLD, reaper);    /* handle child termination */
+     pqsignal(SIGTTIN, SIG_IGN); /* ignored */
+     pqsignal(SIGTTOU, SIG_IGN); /* ignored */
+     /* ignore SIGXFSZ, so that ulimit violations work like disk full */
+ #ifdef SIGXFSZ
+     pqsignal(SIGXFSZ, SIG_IGN); /* ignored */
+ #endif
+
+     /*
       * Options setup
       */
      InitializeGUCOptions();
*************** PostmasterMain(int argc, char *argv[])
*** 1109,1144 ****
      }

      /*
-      * Set up signal handlers for the postmaster process.
-      *
-      * CAUTION: when changing this list, check for side-effects on the signal
-      * handling setup of child processes.  See tcop/postgres.c,
-      * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
-      * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-      * postmaster/syslogger.c, postmaster/bgworker.c and
-      * postmaster/checkpointer.c.
-      */
-     pqinitmask();
-     PG_SETMASK(&BlockSig);
-
-     pqsignal(SIGHUP, SIGHUP_handler);    /* reread config file and have
-                                          * children do same */
-     pqsignal(SIGINT, pmdie);    /* send SIGTERM and shut down */
-     pqsignal(SIGQUIT, pmdie);    /* send SIGQUIT and die */
-     pqsignal(SIGTERM, pmdie);    /* wait for children and shut down */
-     pqsignal(SIGALRM, SIG_IGN); /* ignored */
-     pqsignal(SIGPIPE, SIG_IGN); /* ignored */
-     pqsignal(SIGUSR1, sigusr1_handler); /* message from child process */
-     pqsignal(SIGUSR2, dummy_handler);    /* unused, reserve for children */
-     pqsignal(SIGCHLD, reaper);    /* handle child termination */
-     pqsignal(SIGTTIN, SIG_IGN); /* ignored */
-     pqsignal(SIGTTOU, SIG_IGN); /* ignored */
-     /* ignore SIGXFSZ, so that ulimit violations work like disk full */
- #ifdef SIGXFSZ
-     pqsignal(SIGXFSZ, SIG_IGN); /* ignored */
- #endif
-
-     /*
       * If enabled, start up syslogger collection subprocess
       */
      SysLoggerPID = SysLogger_Start();
--- 1139,1144 ----