Thread: [bug fix] "pg_ctl stop" times out when it should respond quickly
Hello, I've encountered a small bug and fixed it. I guess this occurs on all major releases. I saw this happen on 9.2 and 9.4devel. Please find attached the patch and commit this. [Problem] If I mistakenly set an invalid value to listen_addresses, say '-1', and start the database server, it fails to start as follows. In my environment (RHEL6 for Intel64), it takes about 15 seconds before postgres prints the messages. This is OK. [maumau@myhost pgdata]$ pg_ctl -w start waiting for server to start........................LOG: could not translate host name "-1", service "5450" to address: Temporary failure in name resolution WARNING: could not create listen socket for "-1" FATAL: could not create any TCP/IP sockets stopped waiting pg_ctl: could not start server Examine the log output. [maumau@myhost pgdata]$ When I start the server without -w and try to stop it, "pg_ctl stop" waits for 60 seconds and timed out before it fails. This is what I'm seeing as a problem. I expected "pg_ctl stop" to respond quickly with success or failure depending on the timing. [maumau@myhost pgdata]$ pg_ctl start server starting ...(a few seconds later) [maumau@myhost ~]$ pg_ctl stop waiting for server to shut down................................................. .............. failed pg_ctl: server does not shut down HINT: The "-m fast" option immediately disconnects sessions rather than waiting for session-initiated disconnection. [maumau@myhost ~]$ [Cause] The problem occurs in the sequence below: 1. postmaster creates $PGDATA/postmaster.pid. 2. postmaster tries to resolve the value of listen_addresses to IP addresses. This took about 15 seconds in my failure scenario. 3. During 2, pg_ctl sends SIGTERM to postmaster. 4. postmaster terminates immediately without deleting $PGDATA/postmaster.pid. This is because it hasn't set signal handlers yet. 5. "pg_ctl stop" waits in a loop until $PGDATA/postmaster.pid disappears. But the file does not disappear and it times out. [Fix] Make pg_ctl check if postmaster is still alive, because postmaster might have crashed unexpectedly. Regards MauMau
Attachment
"MauMau" <maumau307@gmail.com> writes: > The problem occurs in the sequence below: > 1. postmaster creates $PGDATA/postmaster.pid. > 2. postmaster tries to resolve the value of listen_addresses to IP > addresses. This took about 15 seconds in my failure scenario. > 3. During 2, pg_ctl sends SIGTERM to postmaster. > 4. postmaster terminates immediately without deleting > $PGDATA/postmaster.pid. This is because it hasn't set signal handlers yet. > 5. "pg_ctl stop" waits in a loop until $PGDATA/postmaster.pid disappears. > But the file does not disappear and it times out. Hm. I wonder if we shouldn't block SIGTERM etc. earlier. It hardly seems improbable that such signals would arrive during a slow startup. > *** 907,913 **** > > for (cnt = 0; cnt < wait_seconds; cnt++) > { > ! if ((pid = get_pgpid()) != 0) > { > print_msg("."); > pg_usleep(1000000); /* 1 sec */ > --- 907,914 ---- > > for (cnt = 0; cnt < wait_seconds; cnt++) > { > ! if ((pid = get_pgpid()) != 0 && > ! postmaster_is_alive((pid_t) pid)) > { > print_msg("."); > pg_usleep(1000000); /* 1 sec */ If you're going to do a postmaster_is_alive check, why bother with repeated get_pgpid()? I think the reason why it was coded like that was that we hadn't written postmaster_is_alive() yet, or maybe we had but didn't want to trust it. However, with the coding you have here, we're fully exposed to any failure modes postmaster_is_alive() may have; so there's not a lot of value in accepting those and get_pgpid's failure modes too. regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > I think the reason why it was coded like that was that we hadn't written > postmaster_is_alive() yet, or maybe we had but didn't want to trust it. > However, with the coding you have here, we're fully exposed to any failure > modes postmaster_is_alive() may have; so there's not a lot of value in > accepting those and get_pgpid's failure modes too. Thank you for reviewing the patch so quickly. You are right, I don't think get_pgpid() here is no longer necessary. If the pid changes in one second, i.e. the original postgres terminates and "pg_ctl start" starts another one, "pg_ctl stop" can terminate successfully because the original postgres it was waiting for actually terminated. I'll submit the revised patch tomorrow. Regards MauMau
From: "Tom Lane" <tgl@sss.pgh.pa.us> > If you're going to do a postmaster_is_alive check, why bother with > repeated get_pgpid()? As I said yesterday, I removed get_pgpid() calls. I'll add this patch to 2014-1 commitfest this weekend if it is not committed until then. Regards MauMau
Attachment
On 12/5/13, 7:07 AM, MauMau wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >> If you're going to do a postmaster_is_alive check, why bother with >> repeated get_pgpid()? > > As I said yesterday, I removed get_pgpid() calls. I'll add this patch > to 2014-1 commitfest this weekend if it is not committed until then. This patch breaks the regression tests: xml ... ok test stats ... ok ============== shutting down postmaster ============== <waits a long time> pg_ctl: server does not shut down pg_regress: could not stop postmaster: exit code was 256
From: "Peter Eisentraut" <peter_e@gmx.net> > This patch breaks the regression tests: > > xml ... ok > test stats ... ok > ============== shutting down postmaster ============== > <waits a long time> > pg_ctl: server does not shut down > > pg_regress: could not stop postmaster: exit code was 256 pg_ctl timed out waiting for the zombie postgres. maumau 19621 18849 0 15:21 pts/9 00:00:00 [postgres] <defunct> maumau 20253 18849 0 15:22 pts/9 00:00:00 /maumau/postgresql-9.4/src/test/regress/./tmp_check/install//maumau/pgsql/bin/pg_ctl stop -D /maumau/postgresql-9.4/src/test/regress/./tmp_check/data -s -m fast pg_regress must wait for postgres to terminate by calling waitpid(), because it invoked postgres directly. The attached pg_regress_pg_stop.patch does this. If you like the combination of this and the original fix for pg_ctl in one patch, please use pg_stop_fail_v3.patch. Sorry for causing trouble. Regards MauMau
Attachment
On 12/25/13, 6:40 AM, MauMau wrote: > pg_regress must wait for postgres to terminate by calling waitpid(), > because it invoked postgres directly. The attached > pg_regress_pg_stop.patch does this. If you like the combination of this > and the original fix for pg_ctl in one patch, please use > pg_stop_fail_v3.patch. This patch doesn't apply.
From: "Peter Eisentraut" <peter_e@gmx.net> > On 12/25/13, 6:40 AM, MauMau wrote: >> pg_regress must wait for postgres to terminate by calling waitpid(), >> because it invoked postgres directly. The attached >> pg_regress_pg_stop.patch does this. If you like the combination of this >> and the original fix for pg_ctl in one patch, please use >> pg_stop_fail_v3.patch. > > This patch doesn't apply. Is that true? Today, I downloaded the following file (whose timestamp was 2014-Jan-04), and could apply pg_stop_fail_v3.patch cleanly. The "make check" succeeded. http://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz Could you confirm again and tell me what problem is happening? Regards MauMau
On Sun, Jan 5, 2014 at 3:49 PM, MauMau <maumau307@gmail.com> wrote: > Could you confirm again and tell me what problem is happening? FWIW, I just quickly tested those two patches independently and got them correctly applied with patch -p1 < $PATCH on master at edc4345. They compiled and passed as well make check. Regards, -- Michael
Le mardi 7 janvier 2014 17:05:03 Michael Paquier a écrit : > On Sun, Jan 5, 2014 at 3:49 PM, MauMau <maumau307@gmail.com> wrote: > > Could you confirm again and tell me what problem is happening? > > FWIW, I just quickly tested those two patches independently and got > them correctly applied with patch -p1 < $PATCH on master at edc4345. > They compiled and passed as well make check. > Regards, Both patches apply cleanly, I'll focus on the pg_stop_fail_v3 patch. Tests are running correctly. The bug this patch is supposed to fix has been reproduced on current HEAD, and cannot be reproduced using the patch. Previous concerns about using both get_pgpid and postmaster_is_alive are adressed. There is no regression tests covering this bugfix, althought I don't know if it would be practical to implement them. -- Ronan Dunklau http://dalibo.com - http://dalibo.org
From: "Ronan Dunklau" <ronan.dunklau@dalibo.com> > There is no regression tests covering this bugfix, althought I don't know > if > it would be practical to implement them. Thanks for reviewing the patch. I'm glad to know that it seems OK. Unfortunately, the current regression test system cannot handle the testing of pg_ctl. Regards MauMau
MauMau escribió: > pg_ctl timed out waiting for the zombie postgres. > > maumau 19621 18849 0 15:21 pts/9 00:00:00 [postgres] <defunct> > maumau 20253 18849 0 15:22 pts/9 00:00:00 /maumau/postgresql-9.4/src/test/regress/./tmp_check/install//maumau/pgsql/bin/pg_ctl > stop -D /maumau/postgresql-9.4/src/test/regress/./tmp_check/data -s > -m fast > > pg_regress must wait for postgres to terminate by calling waitpid(), > because it invoked postgres directly. The attached > pg_regress_pg_stop.patch does this. If you like the combination of > this and the original fix for pg_ctl in one patch, please use > pg_stop_fail_v3.patch. The pg_regress part is ugly. However, pg_regress is doing something unusual when starting postmaster itself, so the ugly coding to stop it seems to match. If we wanted to avoid the ugliness here, the right fix would be to use pg_ctl to start postmaster as well as to stop it. But that'd come at a price, because we would need more ugly code to figure out postmaster's PID. All in all, the compromise proposed by this patch seems acceptable. If we really wanted to make all this real pretty, we could provide a "libpg_ctl" library to start and stop postmaster, as well as query the PID. Probably not worth the trouble. I would apply this patch to all supported branches after this week's release. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 18, 2014 at 1:29 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > MauMau escribió: > The pg_regress part is ugly. However, pg_regress is doing something > unusual when starting postmaster itself, so the ugly coding to stop it > seems to match. If we wanted to avoid the ugliness here, the right fix > would be to use pg_ctl to start postmaster as well as to stop it. But > that'd come at a price, because we would need more ugly code to figure > out postmaster's PID. All in all, the compromise proposed by this patch > seems acceptable. If we really wanted to make all this real pretty, we > could provide a "libpg_ctl" library to start and stop postmaster, as > well as query the PID. Probably not worth the trouble. This might not be worth the trouble for this bug, but actually it could be useful to many third-part tools and extensions to have a common and generic way to do things. I have seen many utilities using a copy/paste of pg_ctl functions and still maintain some of them... Regards, -- Michael
On Mon, Feb 17, 2014 at 11:29 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > The pg_regress part is ugly. However, pg_regress is doing something > unusual when starting postmaster itself, so the ugly coding to stop it > seems to match. If we wanted to avoid the ugliness here, the right fix > would be to use pg_ctl to start postmaster as well as to stop it. I wonder if this would change the behavior in cases where we hit ^C during the regression tests. Right now I think that kills the postmaster as well as pg_regress, but if we used pg_ctl, it might not, because pg_regress uses fork()+exec(), but pg_ctl uses system() to launch a shell which is in turn instructed to background the postmaster. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Feb 17, 2014 at 11:29 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> The pg_regress part is ugly. However, pg_regress is doing something >> unusual when starting postmaster itself, so the ugly coding to stop it >> seems to match. If we wanted to avoid the ugliness here, the right fix >> would be to use pg_ctl to start postmaster as well as to stop it. > I wonder if this would change the behavior in cases where we hit ^C > during the regression tests. Right now I think that kills the > postmaster as well as pg_regress, but if we used pg_ctl, it might not, > because pg_regress uses fork()+exec(), but pg_ctl uses system() to > launch a shell which is in turn instructed to background the > postmaster. After re-reading this thread, I am convinced that this patch is failing to fix the real problem, which is that the postmaster isn't defending against early arrival of a signal; see http://www.postgresql.org/message-id/30805.1386110129@sss.pgh.pa.us Hacking pg_ctl is a band-aid solution, and given these concerns, not a particularly safe band-aid. I experimented with simply moving postmaster.c's initialization of signals further up, and that seems to fix the complained-of problem just fine. It could be an issue if anything in postmaster initialization is expecting to run with signals unblocked, but I don't know what that would be. So I suggest we just do the attached. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 7a2c45a..bb771a6 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** PostmasterMain(int argc, char *argv[]) *** 562,567 **** --- 562,597 ---- getInstallationPaths(argv[0]); /* + * Set up signal handlers for the postmaster process. + * + * CAUTION: when changing this list, check for side-effects on the signal + * handling setup of child processes. See tcop/postgres.c, + * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c, + * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, + * postmaster/syslogger.c, postmaster/bgworker.c and + * postmaster/checkpointer.c. + */ + pqinitmask(); + PG_SETMASK(&BlockSig); + + pqsignal(SIGHUP, SIGHUP_handler); /* reread config file and have + * children do same */ + pqsignal(SIGINT, pmdie); /* send SIGTERM and shut down */ + pqsignal(SIGQUIT, pmdie); /* send SIGQUIT and die */ + pqsignal(SIGTERM, pmdie); /* wait for children and shut down */ + pqsignal(SIGALRM, SIG_IGN); /* ignored */ + pqsignal(SIGPIPE, SIG_IGN); /* ignored */ + pqsignal(SIGUSR1, sigusr1_handler); /* message from child process */ + pqsignal(SIGUSR2, dummy_handler); /* unused, reserve for children */ + pqsignal(SIGCHLD, reaper); /* handle child termination */ + pqsignal(SIGTTIN, SIG_IGN); /* ignored */ + pqsignal(SIGTTOU, SIG_IGN); /* ignored */ + /* ignore SIGXFSZ, so that ulimit violations work like disk full */ + #ifdef SIGXFSZ + pqsignal(SIGXFSZ, SIG_IGN); /* ignored */ + #endif + + /* * Options setup */ InitializeGUCOptions(); *************** PostmasterMain(int argc, char *argv[]) *** 1109,1144 **** } /* - * Set up signal handlers for the postmaster process. - * - * CAUTION: when changing this list, check for side-effects on the signal - * handling setup of child processes. See tcop/postgres.c, - * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c, - * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, - * postmaster/syslogger.c, postmaster/bgworker.c and - * postmaster/checkpointer.c. - */ - pqinitmask(); - PG_SETMASK(&BlockSig); - - pqsignal(SIGHUP, SIGHUP_handler); /* reread config file and have - * children do same */ - pqsignal(SIGINT, pmdie); /* send SIGTERM and shut down */ - pqsignal(SIGQUIT, pmdie); /* send SIGQUIT and die */ - pqsignal(SIGTERM, pmdie); /* wait for children and shut down */ - pqsignal(SIGALRM, SIG_IGN); /* ignored */ - pqsignal(SIGPIPE, SIG_IGN); /* ignored */ - pqsignal(SIGUSR1, sigusr1_handler); /* message from child process */ - pqsignal(SIGUSR2, dummy_handler); /* unused, reserve for children */ - pqsignal(SIGCHLD, reaper); /* handle child termination */ - pqsignal(SIGTTIN, SIG_IGN); /* ignored */ - pqsignal(SIGTTOU, SIG_IGN); /* ignored */ - /* ignore SIGXFSZ, so that ulimit violations work like disk full */ - #ifdef SIGXFSZ - pqsignal(SIGXFSZ, SIG_IGN); /* ignored */ - #endif - - /* * If enabled, start up syslogger collection subprocess */ SysLoggerPID = SysLogger_Start(); --- 1139,1144 ----