Thread: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
My AIX buildfarm members have failed the BinInstallCheck step on and off since inception. It became more frequent when I added animals sungazer and tern alongside the older hornet and mandrill. The animals share a machine with each other and with dozens of other developers. I setpriority() the animals to the lowest available priority, so they probably lose the CPU for long periods. Separately, this machine has slow filesystem metadata operations. For example, git-new-workdir takes ~50s for a PostgreSQL tree. The pg_rewind suite has failed a few times when crash recovery took longer than the 60s pg_ctl default timeout. Disabling fsync (commit 7d7a103) reduced median crash recovery time by 75%, which may suffice. If not, I'll be inclined to add --timeout=900 to each pg_ctl invocation. The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds". You can reproduce that by adding "sleep 3;" between that test and the one before it. The timing dependency comes from the pg_ctl "slop" time: /* * Make sanity checks. If it's for a standalone backend * (negative PID),or the recorded start time is before * pg_ctl started, then either we are looking at the wrong * data directory, or this is a pre-existing pidfile that * hasn't (yet?) been overwritten by ourchild postmaster. * Allow 2 seconds slop for possible cross-process clock * skew. */ The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug that we'd ideally fix as described in the last paragraph of the commit 3c485ca log message: All of this could be improved if we rewrote start_postmaster() so that it could report the child postmaster's PID, sothat we'd know a-priori the correct PID to test with postmaster_is_alive(). That looks like a bit too much changefor so late in the 9.1 development cycle, unfortunately. I recommend we invert the test expectation and, pending the ideal pg_ctl fix, add the "sleep 3" to avoid falling within the time slop: --- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -35,6 +35,7 @@ close CONF;command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], - 'second pg_ctl start succeeds'); +sleep 3; # bridge test_postmaster_connection() slop threshold +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], + 'second pg_ctl start fails');command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctlstop -w'); Alternately, I could just remove the test. crake failed the same way, once: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06
On 2015-09-03 02:25:00 -0400, Noah Misch wrote: > --- a/src/bin/pg_ctl/t/001_start_stop.pl > +++ b/src/bin/pg_ctl/t/001_start_stop.pl > @@ -35,6 +35,7 @@ close CONF; > command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], > 'pg_ctl start -w'); > -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], > - 'second pg_ctl start succeeds'); > +sleep 3; # bridge test_postmaster_connection() slop threshold > +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], > + 'second pg_ctl start fails'); > command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], > 'pg_ctl stop -w'); I'don't like adding a couple seconds of test runtime for the benefit of very slow platforms. The second pg_ctl start doesn't seem to test something very interesting. I'm inclined to just remove it. I'm not caffeinated sufficiently, but afaics that ought to sidestep the issue as stop doesn't depend on the slop time? > crake failed the same way, once: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06 Sounds like an actual production hazard too. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2015-09-03 02:25:00 -0400, Noah Misch wrote: >> --- a/src/bin/pg_ctl/t/001_start_stop.pl >> +++ b/src/bin/pg_ctl/t/001_start_stop.pl >> @@ -35,6 +35,7 @@ close CONF; >> command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], >> 'pg_ctl start -w'); >> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], >> - 'second pg_ctl start succeeds'); >> +sleep 3; # bridge test_postmaster_connection() slop threshold >> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], >> + 'second pg_ctl start fails'); >> command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], >> 'pg_ctl stop -w'); > I'don't like adding a couple seconds of test runtime for the benefit of > very slow platforms. Me either. This is the first time I've seen an indication that the start_postmaster change mentioned in the comment is actually important for production use, rather than just being cleanup. I think we ought to just fix it. I'm willing to take care of the Unix side if someone will explain how to change the Windows side. regards, tom lane
I wrote: > Andres Freund <andres@anarazel.de> writes: >> I'don't like adding a couple seconds of test runtime for the benefit of >> very slow platforms. > Me either. This is the first time I've seen an indication that the > start_postmaster change mentioned in the comment is actually important > for production use, rather than just being cleanup. I think we ought > to just fix it. I'm willing to take care of the Unix side if someone > will explain how to change the Windows side. Attached is a draft patch for this. I think it's fine for Unix (unless someone wants to object to relying on "/bin/sh -c"), but I have no idea whether it works for Windows. The main risk is that if CMD.EXE runs the postmaster as a subprocess rather than overlaying itself a la shell "exec", the PID we'd get back would apply only to CMD.EXE not to the eventual postmaster. If so, I do not know how to fix that, or whether it's fixable at all. Note that this makes the test case in question fail reliably, which is reasonable behavior IMO so I just changed the test. If this does (or can be made to) work on Windows, I'd propose applying it back to 9.2, where the current coding came in. regards, tom lane diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 6a36d29..f0025d7 100644 *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *************** *** 28,33 **** --- 28,34 ---- #include <time.h> #include <sys/types.h> #include <sys/stat.h> + #include <sys/wait.h> #include <unistd.h> #ifdef HAVE_SYS_RESOURCE_H *************** static int CreateRestrictedProcess(char *** 153,162 **** static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static int start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(bool); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) --- 154,163 ---- static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static pgpid_t start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) *************** free_readfile(char **optlines) *** 419,454 **** * start/test/stop routines */ ! static int start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 /* * Since there might be quotes to handle here, it is easier simply to pass * everything to a shell to process them. - * - * XXX it would be better to fork and exec so that we would know the child - * postmaster's PID directly; then test_postmaster_connection could use - * the PID without having to rely on reading it back from the pidfile. */ if (log_file != NULL) ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL, log_file); else ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL); ! return system(cmd); #else /* WIN32 */ /* ! * On win32 we don't use system(). So we don't need to use & (which would ! * be START /B on win32). However, we still call the shell (CMD.EXE) with ! * it to handle redirection etc. */ PROCESS_INFORMATION pi; --- 420,479 ---- * start/test/stop routines */ ! static pgpid_t start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 + pgpid_t pm_pid; + + /* Flush stdio channels just before fork, to avoid double-output problems */ + fflush(stdout); + fflush(stderr); + + pm_pid = fork(); + if (pm_pid < 0) + { + /* fork failed */ + write_stderr(_("%s: could not start server: %s\n"), + progname, strerror(errno)); + exit(1); + } + if (pm_pid > 0) + { + /* fork succeeded, in parent */ + return pm_pid; + } + + /* fork succeeded, in child */ /* * Since there might be quotes to handle here, it is easier simply to pass * everything to a shell to process them. */ if (log_file != NULL) ! snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1", exec_path, pgdata_opt, post_opts, DEVNULL, log_file); else ! snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1", exec_path, pgdata_opt, post_opts, DEVNULL); ! (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL); ! ! /* exec failed */ ! write_stderr(_("%s: could not start server: %s\n"), ! progname, strerror(errno)); ! exit(1); ! ! return 0; /* keep dumb compilers quiet */ ! #else /* WIN32 */ /* ! * As with the Unix case, it's easiest to use the shell (CMD.EXE) to ! * handle redirection etc. */ PROCESS_INFORMATION pi; *************** start_postmaster(void) *** 460,469 **** exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, &pi, false)) ! return GetLastError(); CloseHandle(pi.hProcess); CloseHandle(pi.hThread); ! return 0; #endif /* WIN32 */ } --- 485,498 ---- exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, &pi, false)) ! { ! write_stderr(_("%s: could not start server: error code %lu\n"), ! progname, (unsigned long) GetLastError()); ! exit(1); ! } CloseHandle(pi.hProcess); CloseHandle(pi.hThread); ! return pi.dwProcessId; #endif /* WIN32 */ } *************** start_postmaster(void) *** 476,486 **** * manager checkpoint, it's got nothing to do with database checkpoints!! */ static PGPing ! test_postmaster_connection(bool do_checkpoint) { PGPing ret = PQPING_NO_RESPONSE; - bool found_stale_pidfile = false; - pgpid_t pm_pid = 0; char connstr[MAXPGPATH * 2 + 256]; int i; --- 505,513 ---- * manager checkpoint, it's got nothing to do with database checkpoints!! */ static PGPing ! test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint) { PGPing ret = PQPING_NO_RESPONSE; char connstr[MAXPGPATH * 2 + 256]; int i; *************** test_postmaster_connection(bool do_check *** 535,563 **** optlines[5] != NULL) { /* File is complete enough for us, parse it */ ! long pmpid; time_t pmstart; /* ! * Make sanity checks. If it's for a standalone backend ! * (negative PID), or the recorded start time is before ! * pg_ctl started, then either we are looking at the wrong ! * data directory, or this is a pre-existing pidfile that ! * hasn't (yet?) been overwritten by our child postmaster. ! * Allow 2 seconds slop for possible cross-process clock ! * skew. */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); ! if (pmpid <= 0 || pmstart < start_time - 2) ! { ! /* ! * Set flag to report stale pidfile if it doesn't get ! * overwritten before we give up waiting. ! */ ! found_stale_pidfile = true; ! } ! else { /* * OK, seems to be a valid pidfile from our child. --- 562,581 ---- optlines[5] != NULL) { /* File is complete enough for us, parse it */ ! pgpid_t pmpid; time_t pmstart; /* ! * Make sanity checks. If it's for the wrong PID, or the ! * recorded start time is before pg_ctl started, then ! * either we are looking at the wrong data directory, or ! * this is a pre-existing pidfile that hasn't (yet?) been ! * overwritten by our child postmaster. Allow 2 seconds ! * slop for possible cross-process clock skew. */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); ! if (pmpid == pm_pid && pmstart >= start_time - 2) { /* * OK, seems to be a valid pidfile from our child. *************** test_postmaster_connection(bool do_check *** 567,575 **** char *hostaddr; char host_str[MAXPGPATH]; - found_stale_pidfile = false; - pm_pid = (pgpid_t) pmpid; - /* * Extract port number and host string to use. Prefer * using Unix socket if available. --- 585,590 ---- *************** test_postmaster_connection(bool do_check *** 635,675 **** } /* ! * The postmaster should create postmaster.pid very soon after being ! * started. If it's not there after we've waited 5 or more seconds, ! * assume startup failed and give up waiting. (Note this covers both ! * cases where the pidfile was never created, and where it was created ! * and then removed during postmaster exit.) Also, if there *is* a ! * file there but it appears stale, issue a suitable warning and give ! * up waiting. */ ! if (i >= 5) { ! struct stat statbuf; ! ! if (stat(pid_file, &statbuf) != 0) ! { ! if (errno != ENOENT) ! write_stderr(_("\n%s: could not stat file \"%s\": %s\n"), ! progname, pid_file, strerror(errno)); ! return PQPING_NO_RESPONSE; ! } ! if (found_stale_pidfile) ! { ! write_stderr(_("\n%s: this data directory appears to be running a pre-existing postmaster\n"), ! progname); ! return PQPING_NO_RESPONSE; ! } } /* ! * If we've been able to identify the child postmaster's PID, check ! * the process is still alive. This covers cases where the postmaster ! * successfully created the pidfile but then crashed without removing ! * it. */ ! if (pm_pid > 0 && !postmaster_is_alive((pid_t) pm_pid)) return PQPING_NO_RESPONSE; /* No response, or startup still in process; wait */ --- 650,671 ---- } /* ! * Reap child process if it died; if it remains in zombie state then ! * our kill-based test will think it's still running. */ ! #ifndef WIN32 { ! int exitstatus; ! (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG); } + #endif /* ! * Check whether the child postmaster process is still alive. This ! * lets us exit early if the postmaster fails during startup. */ ! if (!postmaster_is_alive((pid_t) pm_pid)) return PQPING_NO_RESPONSE; /* No response, or startup still in process; wait */ *************** static void *** 836,842 **** do_start(void) { pgpid_t old_pid = 0; ! int exitcode; if (ctl_command != RESTART_COMMAND) { --- 832,838 ---- do_start(void) { pgpid_t old_pid = 0; ! pgpid_t pm_pid; if (ctl_command != RESTART_COMMAND) { *************** do_start(void) *** 876,894 **** } #endif ! exitcode = start_postmaster(); ! if (exitcode != 0) ! { ! write_stderr(_("%s: could not start server: exit code was %d\n"), ! progname, exitcode); ! exit(1); ! } if (do_wait) { print_msg(_("waiting for server to start...")); ! switch (test_postmaster_connection(false)) { case PQPING_OK: print_msg(_(" done\n")); --- 872,884 ---- } #endif ! pm_pid = start_postmaster(); if (do_wait) { print_msg(_("waiting for server to start...")); ! switch (test_postmaster_connection(pm_pid, false)) { case PQPING_OK: print_msg(_(" done\n")); *************** pgwin32_ServiceMain(DWORD argc, LPTSTR * *** 1585,1591 **** if (do_wait) { write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n")); ! if (test_postmaster_connection(true) != PQPING_OK) { write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n")); pgwin32_SetServiceStatus(SERVICE_STOPPED); --- 1575,1581 ---- if (do_wait) { write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n")); ! if (test_postmaster_connection(postmasterPID, true) != PQPING_OK) { write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n")); pgwin32_SetServiceStatus(SERVICE_STOPPED); *************** pgwin32_ServiceMain(DWORD argc, LPTSTR * *** 1606,1613 **** { /* * status.dwCheckPoint can be incremented by ! * test_postmaster_connection(true), so it might not start ! * from 0. */ int maxShutdownCheckPoint = status.dwCheckPoint + 12;; --- 1596,1602 ---- { /* * status.dwCheckPoint can be incremented by ! * test_postmaster_connection(), so it might not start from 0. */ int maxShutdownCheckPoint = status.dwCheckPoint + 12;; diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl index dae47a8..2f93aa0 100644 *** a/src/bin/pg_ctl/t/001_start_stop.pl --- b/src/bin/pg_ctl/t/001_start_stop.pl *************** else *** 32,39 **** close CONF; command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); ! command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], ! 'second pg_ctl start succeeds'); command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctl stop -w'); command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], --- 32,39 ---- close CONF; command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); ! command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], ! 'second pg_ctl start fails'); command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctl stop -w'); command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
On 09/03/2015 03:31 PM, Tom Lane wrote: > I wrote: >> Andres Freund <andres@anarazel.de> writes: >>> I'don't like adding a couple seconds of test runtime for the benefit of >>> very slow platforms. >> Me either. This is the first time I've seen an indication that the >> start_postmaster change mentioned in the comment is actually important >> for production use, rather than just being cleanup. I think we ought >> to just fix it. I'm willing to take care of the Unix side if someone >> will explain how to change the Windows side. > Attached is a draft patch for this. I think it's fine for Unix (unless > someone wants to object to relying on "/bin/sh -c"), but I have no idea > whether it works for Windows. The main risk is that if CMD.EXE runs > the postmaster as a subprocess rather than overlaying itself a la shell > "exec", the PID we'd get back would apply only to CMD.EXE not to the > eventual postmaster. If so, I do not know how to fix that, or whether > it's fixable at all. > > Note that this makes the test case in question fail reliably, which is > reasonable behavior IMO so I just changed the test. > > If this does (or can be made to) work on Windows, I'd propose applying > it back to 9.2, where the current coding came in. There is no equivalent of execl, nor a cmd.exe exquivalent of the shell's exec. But surely the equivalent of the fork/execl you're doing here would be a simple CreateProcess(). I don't see why you need a shell in the middle on Windows at all. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > There is no equivalent of execl, nor a cmd.exe exquivalent of the > shell's exec. But surely the equivalent of the fork/execl you're doing > here would be a simple CreateProcess(). I don't see why you need a shell > in the middle on Windows at all. The problem is to get the output redirection to work. I imagine it could be rewritten without involving the shell, but it'd be less than trivial (or at least, I'm not going to do it). If we did get that to work, it could probably be applied to the service-start case as well, which currently just does a CreateProcess and pays no attention to what happens to postmaster stderr. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > There is no equivalent of execl, nor a cmd.exe exquivalent of the > > shell's exec. But surely the equivalent of the fork/execl you're doing > > here would be a simple CreateProcess(). I don't see why you need a shell > > in the middle on Windows at all. > > The problem is to get the output redirection to work. I imagine it could > be rewritten without involving the shell, but it'd be less than trivial > (or at least, I'm not going to do it). In the CreateProcess model, I think you can use startupInfo.hStdOutput, see https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425%28v=vs.85%29.aspx -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: > I wrote: > > This is the first time I've seen an indication that the > > start_postmaster change mentioned in the comment is actually important > > for production use, rather than just being cleanup. I scratched my head awhile without thinking of a credible production use case that would notice the difference. Which one did you have in mind? > Note that this makes the test case in question fail reliably, which is > reasonable behavior IMO so I just changed the test. That's the better behavior, agreed.
Noah Misch <noah@leadboat.com> writes: > On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: >>> This is the first time I've seen an indication that the >>> start_postmaster change mentioned in the comment is actually important >>> for production use, rather than just being cleanup. > I scratched my head awhile without thinking of a credible production use case > that would notice the difference. Which one did you have in mind? Well, mainly that it's making our regression tests fail, which suggests behavioral instability that could be important for production. Aside from the specific case you diagnosed, it's clear from buildfarm results that the five-second timeout elsewhere in test_postmaster_connection has got funny behavior under load; there are way too many cases where pg_ctl gives up after exactly that long, with no useful information printed. The draft patch I posted gets rid of that behavior ... regards, tom lane
On Fri, Sep 04, 2015 at 10:54:51PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: > >>> This is the first time I've seen an indication that the > >>> start_postmaster change mentioned in the comment is actually important > >>> for production use, rather than just being cleanup. > > > I scratched my head awhile without thinking of a credible production use case > > that would notice the difference. Which one did you have in mind? > > Well, mainly that it's making our regression tests fail, which suggests > behavioral instability that could be important for production. I doubt issuing two "pg_ctl start" within two seconds, with no intervening stop, has a production use. However, ... > Aside from the specific case you diagnosed, it's clear from buildfarm > results that the five-second timeout elsewhere in > test_postmaster_connection has got funny behavior under load; there are > way too many cases where pg_ctl gives up after exactly that long, with > no useful information printed. The draft patch I posted gets rid of that > behavior ... ... fixing this is a good deal more valuable. Besides coping with stalls in early postmaster startup, pg_ctl would no longer take 5s to exit when the postmaster reports a postgresql.conf defect or other early error. On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: > If this does (or can be made to) work on Windows, I'd propose applying > it back to 9.2, where the current coding came in. That seems prudent. We'll keep meeting buildfarm members with performance variability sufficient to reach that 5s timeout. (Incidentally, the current coding first appeared in 9.1.) > --- 650,671 ---- > } > > /* > ! * Reap child process if it died; if it remains in zombie state then > ! * our kill-based test will think it's still running. > */ > ! #ifndef WIN32 > { > ! int exitstatus; > > ! (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG); Let's report any exit status. When a signal kills the young postmaster, it's a waste to omit that detail. > } > + #endif > > /* > ! * Check whether the child postmaster process is still alive. This > ! * lets us exit early if the postmaster fails during startup. > */ > ! if (!postmaster_is_alive((pid_t) pm_pid)) > return PQPING_NO_RESPONSE; kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough. Thanks, nm
Noah Misch <noah@leadboat.com> writes: > kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough. The kill() coding works on Windows (I believe); waitpid() not so much. regards, tom lane
On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough. > > The kill() coding works on Windows (I believe); waitpid() not so much. Windows has the concept under a different name. See postmaster.c.
Noah Misch <noah@leadboat.com> writes: > On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote: >> The kill() coding works on Windows (I believe); waitpid() not so much. > Windows has the concept under a different name. See postmaster.c. Well, if someone else wants to code and test that, feel free. I don't do Windows ... regards, tom lane
I wrote: > Attached is a draft patch for this. I think it's fine for Unix (unless > someone wants to object to relying on "/bin/sh -c"), but I have no idea > whether it works for Windows. The main risk is that if CMD.EXE runs > the postmaster as a subprocess rather than overlaying itself a la shell > "exec", the PID we'd get back would apply only to CMD.EXE not to the > eventual postmaster. If so, I do not know how to fix that, or whether > it's fixable at all. > Note that this makes the test case in question fail reliably, which is > reasonable behavior IMO so I just changed the test. > If this does (or can be made to) work on Windows, I'd propose applying > it back to 9.2, where the current coding came in. Nobody seems to have stepped up to fix the Windows side of this, but after some more thought it occurred to me that maybe it doesn't need (much) fixing. There are two things that pg_ctl wants the PID for: 1. To check if the postmaster.pid file contains the expected child process PID. Well, that's a nice-to-have, but we never had it before and got along well enough, because the start-timestamp comparison check covers most real-world cases. We can omit the PID-match check on Windows. 2. To see whether the child postmaster process is still running, via a "kill(pid, 0)" test. But if we've launched the postmaster through a shell, and not used "&" semantics, presumably that shell will wait for its child process to exit and then exit itself. So *testing whether the shell is still there is just about as good as testing the real postmaster PID*. Therefore, we don't really need to worry about rewriting the Windows subprocess-launch logic. If someone would like to do it, that would be nice, but we don't have to wait for that to happen before we can get rid of the 5-second-timeout issues. So the attached modified patch adjusts the PID-match logic and some comments, but is otherwise what I posted before. I believe that this might actually work on Windows, but I have no way to test it. Someone please try that? (Don't forget to test the service-start path, too.) regards, tom lane diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 6a36d29..62db328 100644 *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *************** *** 28,33 **** --- 28,34 ---- #include <time.h> #include <sys/types.h> #include <sys/stat.h> + #include <sys/wait.h> #include <unistd.h> #ifdef HAVE_SYS_RESOURCE_H *************** static int CreateRestrictedProcess(char *** 153,162 **** static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static int start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(bool); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) --- 154,163 ---- static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static pgpid_t start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) *************** free_readfile(char **optlines) *** 419,454 **** * start/test/stop routines */ ! static int start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 /* * Since there might be quotes to handle here, it is easier simply to pass ! * everything to a shell to process them. ! * ! * XXX it would be better to fork and exec so that we would know the child ! * postmaster's PID directly; then test_postmaster_connection could use ! * the PID without having to rely on reading it back from the pidfile. */ if (log_file != NULL) ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL, log_file); else ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL); ! return system(cmd); #else /* WIN32 */ /* ! * On win32 we don't use system(). So we don't need to use & (which would ! * be START /B on win32). However, we still call the shell (CMD.EXE) with ! * it to handle redirection etc. */ PROCESS_INFORMATION pi; --- 420,489 ---- * start/test/stop routines */ ! /* ! * Start the postmaster and return its PID. ! * ! * Currently, on Windows what we return is the PID of the shell process ! * that launched the postmaster (and, we trust, is waiting for it to exit). ! * So the PID is usable for "is the postmaster still running" checks, ! * but cannot be compared directly to postmaster.pid. ! */ ! static pgpid_t start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 + pgpid_t pm_pid; + + /* Flush stdio channels just before fork, to avoid double-output problems */ + fflush(stdout); + fflush(stderr); + + pm_pid = fork(); + if (pm_pid < 0) + { + /* fork failed */ + write_stderr(_("%s: could not start server: %s\n"), + progname, strerror(errno)); + exit(1); + } + if (pm_pid > 0) + { + /* fork succeeded, in parent */ + return pm_pid; + } + + /* fork succeeded, in child */ /* * Since there might be quotes to handle here, it is easier simply to pass ! * everything to a shell to process them. Use exec so that the postmaster ! * has the same PID as the current child process. */ if (log_file != NULL) ! snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1", exec_path, pgdata_opt, post_opts, DEVNULL, log_file); else ! snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1", exec_path, pgdata_opt, post_opts, DEVNULL); ! (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL); ! ! /* exec failed */ ! write_stderr(_("%s: could not start server: %s\n"), ! progname, strerror(errno)); ! exit(1); ! ! return 0; /* keep dumb compilers quiet */ ! #else /* WIN32 */ /* ! * As with the Unix case, it's easiest to use the shell (CMD.EXE) to ! * handle redirection etc. Unfortunately CMD.EXE lacks any equivalent of ! * "exec", so we don't get to find out the postmaster's PID immediately. */ PROCESS_INFORMATION pi; *************** start_postmaster(void) *** 460,469 **** exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, &pi, false)) ! return GetLastError(); CloseHandle(pi.hProcess); CloseHandle(pi.hThread); ! return 0; #endif /* WIN32 */ } --- 495,508 ---- exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, &pi, false)) ! { ! write_stderr(_("%s: could not start server: error code %lu\n"), ! progname, (unsigned long) GetLastError()); ! exit(1); ! } CloseHandle(pi.hProcess); CloseHandle(pi.hThread); ! return pi.dwProcessId; /* Shell's PID, not postmaster's! */ #endif /* WIN32 */ } *************** start_postmaster(void) *** 472,486 **** /* * Find the pgport and try a connection * * Note that the checkpoint parameter enables a Windows service control * manager checkpoint, it's got nothing to do with database checkpoints!! */ static PGPing ! test_postmaster_connection(bool do_checkpoint) { PGPing ret = PQPING_NO_RESPONSE; - bool found_stale_pidfile = false; - pgpid_t pm_pid = 0; char connstr[MAXPGPATH * 2 + 256]; int i; --- 511,527 ---- /* * Find the pgport and try a connection * + * On Unix, pm_pid is the PID of the just-launched postmaster. On Windows, + * it's the PID of an ancestor shell process, so we can't check the contents + * of postmaster.pid quite as carefully. + * * Note that the checkpoint parameter enables a Windows service control * manager checkpoint, it's got nothing to do with database checkpoints!! */ static PGPing ! test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint) { PGPing ret = PQPING_NO_RESPONSE; char connstr[MAXPGPATH * 2 + 256]; int i; *************** test_postmaster_connection(bool do_check *** 535,563 **** optlines[5] != NULL) { /* File is complete enough for us, parse it */ ! long pmpid; time_t pmstart; /* ! * Make sanity checks. If it's for a standalone backend ! * (negative PID), or the recorded start time is before ! * pg_ctl started, then either we are looking at the wrong ! * data directory, or this is a pre-existing pidfile that ! * hasn't (yet?) been overwritten by our child postmaster. ! * Allow 2 seconds slop for possible cross-process clock ! * skew. */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); ! if (pmpid <= 0 || pmstart < start_time - 2) ! { ! /* ! * Set flag to report stale pidfile if it doesn't get ! * overwritten before we give up waiting. ! */ ! found_stale_pidfile = true; ! } ! else { /* * OK, seems to be a valid pidfile from our child. --- 576,602 ---- optlines[5] != NULL) { /* File is complete enough for us, parse it */ ! pgpid_t pmpid; time_t pmstart; /* ! * Make sanity checks. If it's for the wrong PID, or the ! * recorded start time is before pg_ctl started, then ! * either we are looking at the wrong data directory, or ! * this is a pre-existing pidfile that hasn't (yet?) been ! * overwritten by our child postmaster. Allow 2 seconds ! * slop for possible cross-process clock skew. */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); ! if (pmstart >= start_time - 2 && ! #ifndef WIN32 ! pmpid == pm_pid ! #else ! /* Windows can only reject standalone-backend PIDs */ ! pmpid > 0 ! #endif ! ) { /* * OK, seems to be a valid pidfile from our child. *************** test_postmaster_connection(bool do_check *** 567,575 **** char *hostaddr; char host_str[MAXPGPATH]; - found_stale_pidfile = false; - pm_pid = (pgpid_t) pmpid; - /* * Extract port number and host string to use. Prefer * using Unix socket if available. --- 606,611 ---- *************** test_postmaster_connection(bool do_check *** 635,675 **** } /* ! * The postmaster should create postmaster.pid very soon after being ! * started. If it's not there after we've waited 5 or more seconds, ! * assume startup failed and give up waiting. (Note this covers both ! * cases where the pidfile was never created, and where it was created ! * and then removed during postmaster exit.) Also, if there *is* a ! * file there but it appears stale, issue a suitable warning and give ! * up waiting. */ ! if (i >= 5) { ! struct stat statbuf; ! ! if (stat(pid_file, &statbuf) != 0) ! { ! if (errno != ENOENT) ! write_stderr(_("\n%s: could not stat file \"%s\": %s\n"), ! progname, pid_file, strerror(errno)); ! return PQPING_NO_RESPONSE; ! } ! if (found_stale_pidfile) ! { ! write_stderr(_("\n%s: this data directory appears to be running a pre-existing postmaster\n"), ! progname); ! return PQPING_NO_RESPONSE; ! } } /* ! * If we've been able to identify the child postmaster's PID, check ! * the process is still alive. This covers cases where the postmaster ! * successfully created the pidfile but then crashed without removing ! * it. */ ! if (pm_pid > 0 && !postmaster_is_alive((pid_t) pm_pid)) return PQPING_NO_RESPONSE; /* No response, or startup still in process; wait */ --- 671,695 ---- } /* ! * Reap child process if it died; if it remains in zombie state then ! * our kill-based test will think it's still running. */ ! #ifndef WIN32 { ! int exitstatus; ! (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG); } + #endif /* ! * Check whether the child postmaster process is still alive. This ! * lets us exit early if the postmaster fails during startup. ! * ! * On Windows, we are checking the postmaster's parent shell, but ! * that's fine for this purpose. */ ! if (!postmaster_is_alive((pid_t) pm_pid)) return PQPING_NO_RESPONSE; /* No response, or startup still in process; wait */ *************** static void *** 836,842 **** do_start(void) { pgpid_t old_pid = 0; ! int exitcode; if (ctl_command != RESTART_COMMAND) { --- 856,862 ---- do_start(void) { pgpid_t old_pid = 0; ! pgpid_t pm_pid; if (ctl_command != RESTART_COMMAND) { *************** do_start(void) *** 876,894 **** } #endif ! exitcode = start_postmaster(); ! if (exitcode != 0) ! { ! write_stderr(_("%s: could not start server: exit code was %d\n"), ! progname, exitcode); ! exit(1); ! } if (do_wait) { print_msg(_("waiting for server to start...")); ! switch (test_postmaster_connection(false)) { case PQPING_OK: print_msg(_(" done\n")); --- 896,908 ---- } #endif ! pm_pid = start_postmaster(); if (do_wait) { print_msg(_("waiting for server to start...")); ! switch (test_postmaster_connection(pm_pid, false)) { case PQPING_OK: print_msg(_(" done\n")); *************** pgwin32_ServiceMain(DWORD argc, LPTSTR * *** 1585,1591 **** if (do_wait) { write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n")); ! if (test_postmaster_connection(true) != PQPING_OK) { write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n")); pgwin32_SetServiceStatus(SERVICE_STOPPED); --- 1599,1605 ---- if (do_wait) { write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n")); ! if (test_postmaster_connection(postmasterPID, true) != PQPING_OK) { write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n")); pgwin32_SetServiceStatus(SERVICE_STOPPED); *************** pgwin32_ServiceMain(DWORD argc, LPTSTR * *** 1606,1613 **** { /* * status.dwCheckPoint can be incremented by ! * test_postmaster_connection(true), so it might not start ! * from 0. */ int maxShutdownCheckPoint = status.dwCheckPoint + 12;; --- 1620,1626 ---- { /* * status.dwCheckPoint can be incremented by ! * test_postmaster_connection(), so it might not start from 0. */ int maxShutdownCheckPoint = status.dwCheckPoint + 12;; diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl index 177a16f..3cd5a44 100644 *** a/src/bin/pg_ctl/t/001_start_stop.pl --- b/src/bin/pg_ctl/t/001_start_stop.pl *************** else *** 34,41 **** close CONF; command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); ! command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], ! 'second pg_ctl start succeeds'); command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctl stop -w'); command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], --- 34,41 ---- close CONF; command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], 'pg_ctl start -w'); ! command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], ! 'second pg_ctl start fails'); command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], 'pg_ctl stop -w'); command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
I wrote: > So the attached modified patch adjusts the PID-match logic and some > comments, but is otherwise what I posted before. I believe that this > might actually work on Windows, but I have no way to test it. Someone > please try that? (Don't forget to test the service-start path, too.) Has anyone tried that patch on Windows? I'd like to push it in hopes of improving buildfarm stability, but I'm hesitant to do so without some confirmation that it acts as I think it will. regards, tom lane
On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> So the attached modified patch adjusts the PID-match logic and some >> comments, but is otherwise what I posted before. I believe that this >> might actually work on Windows, but I have no way to test it. Someone >> please try that? (Don't forget to test the service-start path, too.) > > Has anyone tried that patch on Windows? I'd like to push it in hopes of > improving buildfarm stability, but I'm hesitant to do so without some > confirmation that it acts as I think it will. I marked this patch as something to look at, though I haven't take the time to do it yet... -- Michael
On Wed, Oct 7, 2015 at 7:05 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wrote: >>> So the attached modified patch adjusts the PID-match logic and some >>> comments, but is otherwise what I posted before. I believe that this >>> might actually work on Windows, but I have no way to test it. Someone >>> please try that? (Don't forget to test the service-start path, too.) >> >> Has anyone tried that patch on Windows? I'd like to push it in hopes of >> improving buildfarm stability, but I'm hesitant to do so without some >> confirmation that it acts as I think it will. > > I marked this patch as something to look at, though I haven't take the > time to do it yet... s/take/taken -- Michael
On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: > So the attached modified patch adjusts the PID-match logic and some > comments, but is otherwise what I posted before. I believe that this > might actually work on Windows, but I have no way to test it. Someone > please try that? (Don't forget to test the service-start path, too.) If "pg_ctl start" is invoked directly from a command prompt, it works. Now, if I run "pg_ctl start" within a script (in an msysgit environment for example), pg_ctl fails, complaining that postmaster.pid already exists. The TAP tests get broken as well, attached are the log files of the run, and here is an interested bit: # Running: pg_ctl start -D C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w waiting for server to start... stopped waiting not ok 12 - pg_ctl start -w # Failed test 'pg_ctl start -w' # at C:/Users/ioltas/git/postgres/src/test/perl/TestLib.pm line 282. # Running: pg_ctl start -D C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w waiting for server to start...FATAL: lock file "postmaster.pid" already exists HINT: Is another postmaster (PID 4040) running in data directory "C:/Users/ioltas/git/postgres/src/bin/pg_ctl/tmp_testor2K/data"? LOG: database system was shut down at 2015-10-06 23:58:02 PDT LOG: MultiXact member wraparound protections are now enabled FATAL: the database system is starting up LOG: database system is ready to accept connections LOG: autovacuum launcher started stopped waiting Registering a service to the SCM works, but the service registered refuses to start. Regards, -- Michael
Attachment
Michael Paquier <michael.paquier@gmail.com> writes: > On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: >> So the attached modified patch adjusts the PID-match logic and some >> comments, but is otherwise what I posted before. I believe that this >> might actually work on Windows, but I have no way to test it. Someone >> please try that? (Don't forget to test the service-start path, too.) > If "pg_ctl start" is invoked directly from a command prompt, it works. > Now, if I run "pg_ctl start" within a script (in an msysgit > environment for example), pg_ctl fails, complaining that > postmaster.pid already exists. The TAP tests get broken as well, Hm, the TAP trace makes it look like the postmaster start does actually work, but pg_ctl isn't correctly waiting for that to happen. (The complaint about "already exists" seems to be from the second start attempt, and is exactly what to expect there.) I could believe that I'd fat-fingered the PID comparison logic in test_postmaster_connection, but I don't understand how it would work from a command prompt but not from a script. Any ideas? regards, tom lane
Michael Paquier <michael.paquier@gmail.com> writes: > On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: >> So the attached modified patch adjusts the PID-match logic and some >> comments, but is otherwise what I posted before. I believe that this >> might actually work on Windows, but I have no way to test it. Someone >> please try that? (Don't forget to test the service-start path, too.) > If "pg_ctl start" is invoked directly from a command prompt, it works. > Now, if I run "pg_ctl start" within a script (in an msysgit > environment for example), pg_ctl fails, complaining that > postmaster.pid already exists. The TAP tests get broken as well, Reading that again, I think you mean that "pg_ctl start" works but you didn't try "pg_ctl start -w" manually? Because it's "-w" that's at issue here, and the failing test cases are using that. After studying the test logs more carefully, I think the behavior they're showing is consistent with the idea that postmaster_is_alive() doesn't work on the CMD shell process. Which seems very likely now that I think about it, because we're depending on an implementation of kill() that is PG-specific. So in fact this idea doesn't work :-(. I think there is still room to salvage something without fully rewriting the postmaster invocation logic to avoid using CMD, because it's still true that checking whether the CMD process is still there should be as good as checking the postmaster proper. We just can't use kill() for it. I'd be inclined to get rid of the use of kill() on Unix as well; as Noah mentioned upthread, if we're using fork/exec we might as well pay attention to waitpid's results instead. On Windows, I imagine that the thing to do is have start_postmaster() save aside the handle for the CMD process, and then in test_postmaster_connection(), check the handle state to see if the process is still running (I assume there's a way to do that). I can take care of the Unix side of that, but as before, somebody else would need to code and test the Windows side. It's probably just a few lines of code to add ... any volunteers? regards, tom lane
On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: >>> So the attached modified patch adjusts the PID-match logic and some >>> comments, but is otherwise what I posted before. I believe that this >>> might actually work on Windows, but I have no way to test it. Someone >>> please try that? (Don't forget to test the service-start path, too.) > >> If "pg_ctl start" is invoked directly from a command prompt, it works. >> Now, if I run "pg_ctl start" within a script (in an msysgit >> environment for example), pg_ctl fails, complaining that >> postmaster.pid already exists. The TAP tests get broken as well, > > Reading that again, I think you mean that "pg_ctl start" works but you > didn't try "pg_ctl start -w" manually? Because it's "-w" that's at > issue here, and the failing test cases are using that. Yes, sorry. I fat-fingered the command from the prompt and forgot the -w switch. test_postmaster_connection just behaves incorrectly, and exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting, could not start server, blah". > I think there is still room to salvage something without fully rewriting > the postmaster invocation logic to avoid using CMD, because it's still > true that checking whether the CMD process is still there should be as > good as checking the postmaster proper. We just can't use kill() for it. > I'd be inclined to get rid of the use of kill() on Unix as well; as Noah > mentioned upthread, if we're using fork/exec we might as well pay > attention to waitpid's results instead. On Windows, I imagine that the > thing to do is have start_postmaster() save aside the handle for the CMD > process, and then in test_postmaster_connection(), check the handle state > to see if the process is still running (I assume there's a way to do > that). That would be WaitForSingleObject(handle, ms_timeout) == WAIT_OBJECT_0, no? The handle should be picked up from CreateRestrictedProcess, and I think that CloseHandle should not be called on pi.hProcess if we are going to wait for it afterwards. Reference: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx > I can take care of the Unix side of that, but as before, somebody else > would need to code and test the Windows side. It's probably just a few > lines of code to add ... any volunteers? I could give it a shot, do you still want to rework the Unix portion of the patch? There are not many folks around able to test the patch either way. Regards, -- Michael
On Thu, Oct 8, 2015 at 3:09 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Michael Paquier <michael.paquier@gmail.com> writes: >>> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: >>>> So the attached modified patch adjusts the PID-match logic and some >>>> comments, but is otherwise what I posted before. I believe that this >>>> might actually work on Windows, but I have no way to test it. Someone >>>> please try that? (Don't forget to test the service-start path, too.) >> >>> If "pg_ctl start" is invoked directly from a command prompt, it works. >>> Now, if I run "pg_ctl start" within a script (in an msysgit >>> environment for example), pg_ctl fails, complaining that >>> postmaster.pid already exists. The TAP tests get broken as well, >> >> Reading that again, I think you mean that "pg_ctl start" works but you >> didn't try "pg_ctl start -w" manually? Because it's "-w" that's at >> issue here, and the failing test cases are using that. > > Yes, sorry. I fat-fingered the command from the prompt and forgot the > -w switch. test_postmaster_connection just behaves incorrectly, and > exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting, > could not start server, blah". > >> I think there is still room to salvage something without fully rewriting >> the postmaster invocation logic to avoid using CMD, because it's still >> true that checking whether the CMD process is still there should be as >> good as checking the postmaster proper. We just can't use kill() for it. >> I'd be inclined to get rid of the use of kill() on Unix as well; as Noah >> mentioned upthread, if we're using fork/exec we might as well pay >> attention to waitpid's results instead. On Windows, I imagine that the >> thing to do is have start_postmaster() save aside the handle for the CMD >> process, and then in test_postmaster_connection(), check the handle state >> to see if the process is still running (I assume there's a way to do >> that). > > That would be WaitForSingleObject(handle, ms_timeout) == > WAIT_OBJECT_0, no? The handle should be picked up from > CreateRestrictedProcess, and I think that CloseHandle should not be > called on pi.hProcess if we are going to wait for it afterwards. > Reference: > https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx I had a look at that, and attached is an updated patch showing the concept of using an HANDLE that pg_ctl waits for. I agree that saving an HANDLE the way this patch does using a static variable is a bit ugly especially knowing that service registration uses test_postmaster_connection as well with directly a postmaster. The good thing is that tapcheck runs smoothly, with one single failure though: the second call to pg_ctl start -w may succeed instead of failing if kicked within an interval of 3 seconds after the first one, based on the tests on my Windows VM. My guess is that this is caused by the fact that we monitor the shell process and not the postmaster itself. Thoughts? Regards, -- Michael
Attachment
Michael Paquier wrote: > > That would be WaitForSingleObject(handle, ms_timeout) == > > WAIT_OBJECT_0, no? The handle should be picked up from > > CreateRestrictedProcess, and I think that CloseHandle should not be > > called on pi.hProcess if we are going to wait for it afterwards. > > Reference: > > https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx > > I had a look at that, and attached is an updated patch showing the > concept of using an HANDLE that pg_ctl waits for. I wonder if you are interested in rewriting this whole thing to not use cmd.exe at all, which as I understand is just about output redirection. Earlier in the thread I pointed out that on Microsoft docs they claim that you can redirect stdout/stderr when creating a new process without going through cmd.exe; see https://www.postgresql.org/message-id/20150903202114.GG2912%40alvherre.pgsql -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I wonder if you are interested in rewriting this whole thing to not use > cmd.exe at all, which as I understand is just about output redirection. FWIW, I have little interest in going there, because I'm afraid that getting it to be bug-compatible with the existing behavior on points such as quoting of -o arguments would be extremely fiddly. Perhaps it'd be worth doing it in master, but I'd be afraid to back-patch such a change --- and what I'm after right now is a back-patchable way of getting rid of the 5-second-timeout behavior. regards, tom lane
Michael Paquier <michael.paquier@gmail.com> writes: >> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think there is still room to salvage something without fully rewriting >>> the postmaster invocation logic to avoid using CMD, because it's still >>> true that checking whether the CMD process is still there should be as >>> good as checking the postmaster proper. We just can't use kill() for it. > I had a look at that, and attached is an updated patch showing the > concept of using an HANDLE that pg_ctl waits for. I agree that saving > an HANDLE the way this patch does using a static variable is a bit > ugly especially knowing that service registration uses > test_postmaster_connection as well with directly a postmaster. The > good thing is that tapcheck runs smoothly, with one single failure > though: the second call to pg_ctl start -w may succeed instead of > failing if kicked within an interval of 3 seconds after the first one, > based on the tests on my Windows VM. My guess is that this is caused > by the fact that we monitor the shell process and not the postmaster > itself. Thoughts? Yeah, it can still succeed because pg_ctl can't tell that the postmaster.pid created by the earlier invocation isn't the one it wants. It adopts the values out of that file, tests the connection, finds it works, and declares victory, not realizing that the postmaster *it* started will soon fail (or maybe already has). Waiting more than 2 seconds is enough to make sure that test_postmaster_connection sees the pre-existing pidfile as stale and doesn't believe that it represents a successful postmaster start. So there's still something to be desired on Windows; but it's still better than before in that we can reliably detect child process exit instead of having to use the five-second heuristic. And of course on Unix this is way better than before. So I've pushed this with some cosmetic adjustments, as well as the not so cosmetic adjustment of making the service-start path also use handle testing. If there are remaining problems, the buildfarm should tell us. I'm not sure if this will completely fix our problems with "pg_ctl start" related buildfarm failures on very slow critters. It does get rid of the hard wired 5-second timeout, but the 60-second timeout could still be an issue. I think Noah was considering a patch to allow that number to be raised. I'd be in favor of letting pg_ctl accept a default timeout length from an environment variable, and then the slower critters could be fixed by adjusting their buildfarm configurations. regards, tom lane
On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So there's still something to be desired on Windows; but it's still > better than before in that we can reliably detect child process exit > instead of having to use the five-second heuristic. And of course on > Unix this is way better than before. > > So I've pushed this with some cosmetic adjustments, as well as the not > so cosmetic adjustment of making the service-start path also use handle > testing. If there are remaining problems, the buildfarm should tell us. Thanks! I used at some point when hacking postmasterProcess as well to save the handle but for clarity as this was already used for shutdown when running things as a service of the Windows SCM I put that in another variable. So far so good based on some extra testing I just did now. > I'm not sure if this will completely fix our problems with "pg_ctl start" > related buildfarm failures on very slow critters. It does get rid of the > hard wired 5-second timeout, but the 60-second timeout could still be an > issue. I think Noah was considering a patch to allow that number to be > raised. I'd be in favor of letting pg_ctl accept a default timeout length > from an environment variable, and then the slower critters could be fixed > by adjusting their buildfarm configurations. Being able to pass that as a command-line parameter (master-only change) would be welcome as well IMO. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm not sure if this will completely fix our problems with "pg_ctl start" >> related buildfarm failures on very slow critters. It does get rid of the >> hard wired 5-second timeout, but the 60-second timeout could still be an >> issue. I think Noah was considering a patch to allow that number to be >> raised. I'd be in favor of letting pg_ctl accept a default timeout length >> from an environment variable, and then the slower critters could be fixed >> by adjusting their buildfarm configurations. > Being able to pass that as a command-line parameter (master-only > change) would be welcome as well IMO. Huh? The --timeout parameter has always been there. I'm just expressing an opinion that modifying all the makefiles and test scripts to pass through a timeout setting would be too messy. regards, tom lane