Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members - Mailing list pgsql-hackers

From Tom Lane
Subject Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Date
Msg-id 5525.1443226365@sss.pgh.pa.us
Whole thread Raw
In response to Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
I wrote:
> Attached is a draft patch for this.  I think it's fine for Unix (unless
> someone wants to object to relying on "/bin/sh -c"), but I have no idea
> whether it works for Windows.  The main risk is that if CMD.EXE runs
> the postmaster as a subprocess rather than overlaying itself a la shell
> "exec", the PID we'd get back would apply only to CMD.EXE not to the
> eventual postmaster.  If so, I do not know how to fix that, or whether
> it's fixable at all.

> Note that this makes the test case in question fail reliably, which is
> reasonable behavior IMO so I just changed the test.

> If this does (or can be made to) work on Windows, I'd propose applying
> it back to 9.2, where the current coding came in.

Nobody seems to have stepped up to fix the Windows side of this, but
after some more thought it occurred to me that maybe it doesn't need
(much) fixing.  There are two things that pg_ctl wants the PID for:

1. To check if the postmaster.pid file contains the expected child process
PID.  Well, that's a nice-to-have, but we never had it before and got
along well enough, because the start-timestamp comparison check covers
most real-world cases.  We can omit the PID-match check on Windows.

2. To see whether the child postmaster process is still running, via a
"kill(pid, 0)" test.  But if we've launched the postmaster through a
shell, and not used "&" semantics, presumably that shell will wait for its
child process to exit and then exit itself.  So *testing whether the shell
is still there is just about as good as testing the real postmaster PID*.

Therefore, we don't really need to worry about rewriting the Windows
subprocess-launch logic.  If someone would like to do it, that would
be nice, but we don't have to wait for that to happen before we can
get rid of the 5-second-timeout issues.

So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before.  I believe that this
might actually work on Windows, but I have no way to test it.  Someone
please try that?  (Don't forget to test the service-start path, too.)

            regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 6a36d29..62db328 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 28,33 ****
--- 28,34 ----
  #include <time.h>
  #include <sys/types.h>
  #include <sys/stat.h>
+ #include <sys/wait.h>
  #include <unistd.h>

  #ifdef HAVE_SYS_RESOURCE_H
*************** static int    CreateRestrictedProcess(char
*** 153,162 ****
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static int    start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(bool);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
--- 154,163 ----
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static pgpid_t start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
*************** free_readfile(char **optlines)
*** 419,454 ****
   * start/test/stop routines
   */

! static int
  start_postmaster(void)
  {
      char        cmd[MAXPGPATH];

  #ifndef WIN32

      /*
       * Since there might be quotes to handle here, it is easier simply to pass
!      * everything to a shell to process them.
!      *
!      * XXX it would be better to fork and exec so that we would know the child
!      * postmaster's PID directly; then test_postmaster_connection could use
!      * the PID without having to rely on reading it back from the pidfile.
       */
      if (log_file != NULL)
!         snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &",
                   exec_path, pgdata_opt, post_opts,
                   DEVNULL, log_file);
      else
!         snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &",
                   exec_path, pgdata_opt, post_opts, DEVNULL);

!     return system(cmd);
  #else                            /* WIN32 */

      /*
!      * On win32 we don't use system(). So we don't need to use & (which would
!      * be START /B on win32). However, we still call the shell (CMD.EXE) with
!      * it to handle redirection etc.
       */
      PROCESS_INFORMATION pi;

--- 420,489 ----
   * start/test/stop routines
   */

! /*
!  * Start the postmaster and return its PID.
!  *
!  * Currently, on Windows what we return is the PID of the shell process
!  * that launched the postmaster (and, we trust, is waiting for it to exit).
!  * So the PID is usable for "is the postmaster still running" checks,
!  * but cannot be compared directly to postmaster.pid.
!  */
! static pgpid_t
  start_postmaster(void)
  {
      char        cmd[MAXPGPATH];

  #ifndef WIN32
+     pgpid_t        pm_pid;
+
+     /* Flush stdio channels just before fork, to avoid double-output problems */
+     fflush(stdout);
+     fflush(stderr);
+
+     pm_pid = fork();
+     if (pm_pid < 0)
+     {
+         /* fork failed */
+         write_stderr(_("%s: could not start server: %s\n"),
+                      progname, strerror(errno));
+         exit(1);
+     }
+     if (pm_pid > 0)
+     {
+         /* fork succeeded, in parent */
+         return pm_pid;
+     }
+
+     /* fork succeeded, in child */

      /*
       * Since there might be quotes to handle here, it is easier simply to pass
!      * everything to a shell to process them.  Use exec so that the postmaster
!      * has the same PID as the current child process.
       */
      if (log_file != NULL)
!         snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
                   exec_path, pgdata_opt, post_opts,
                   DEVNULL, log_file);
      else
!         snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
                   exec_path, pgdata_opt, post_opts, DEVNULL);

!     (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
!
!     /* exec failed */
!     write_stderr(_("%s: could not start server: %s\n"),
!                  progname, strerror(errno));
!     exit(1);
!
!     return 0;                    /* keep dumb compilers quiet */
!
  #else                            /* WIN32 */

      /*
!      * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
!      * handle redirection etc.  Unfortunately CMD.EXE lacks any equivalent of
!      * "exec", so we don't get to find out the postmaster's PID immediately.
       */
      PROCESS_INFORMATION pi;

*************** start_postmaster(void)
*** 460,469 ****
                   exec_path, pgdata_opt, post_opts, DEVNULL);

      if (!CreateRestrictedProcess(cmd, &pi, false))
!         return GetLastError();
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
!     return 0;
  #endif   /* WIN32 */
  }

--- 495,508 ----
                   exec_path, pgdata_opt, post_opts, DEVNULL);

      if (!CreateRestrictedProcess(cmd, &pi, false))
!     {
!         write_stderr(_("%s: could not start server: error code %lu\n"),
!                      progname, (unsigned long) GetLastError());
!         exit(1);
!     }
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
!     return pi.dwProcessId;        /* Shell's PID, not postmaster's! */
  #endif   /* WIN32 */
  }

*************** start_postmaster(void)
*** 472,486 ****
  /*
   * Find the pgport and try a connection
   *
   * Note that the checkpoint parameter enables a Windows service control
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
  static PGPing
! test_postmaster_connection(bool do_checkpoint)
  {
      PGPing        ret = PQPING_NO_RESPONSE;
-     bool        found_stale_pidfile = false;
-     pgpid_t        pm_pid = 0;
      char        connstr[MAXPGPATH * 2 + 256];
      int            i;

--- 511,527 ----
  /*
   * Find the pgport and try a connection
   *
+  * On Unix, pm_pid is the PID of the just-launched postmaster.  On Windows,
+  * it's the PID of an ancestor shell process, so we can't check the contents
+  * of postmaster.pid quite as carefully.
+  *
   * Note that the checkpoint parameter enables a Windows service control
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
  static PGPing
! test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint)
  {
      PGPing        ret = PQPING_NO_RESPONSE;
      char        connstr[MAXPGPATH * 2 + 256];
      int            i;

*************** test_postmaster_connection(bool do_check
*** 535,563 ****
                           optlines[5] != NULL)
                  {
                      /* File is complete enough for us, parse it */
!                     long        pmpid;
                      time_t        pmstart;

                      /*
!                      * Make sanity checks.  If it's for a standalone backend
!                      * (negative PID), or the recorded start time is before
!                      * pg_ctl started, then either we are looking at the wrong
!                      * data directory, or this is a pre-existing pidfile that
!                      * hasn't (yet?) been overwritten by our child postmaster.
!                      * Allow 2 seconds slop for possible cross-process clock
!                      * skew.
                       */
                      pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
                      pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!                     if (pmpid <= 0 || pmstart < start_time - 2)
!                     {
!                         /*
!                          * Set flag to report stale pidfile if it doesn't get
!                          * overwritten before we give up waiting.
!                          */
!                         found_stale_pidfile = true;
!                     }
!                     else
                      {
                          /*
                           * OK, seems to be a valid pidfile from our child.
--- 576,602 ----
                           optlines[5] != NULL)
                  {
                      /* File is complete enough for us, parse it */
!                     pgpid_t        pmpid;
                      time_t        pmstart;

                      /*
!                      * Make sanity checks.  If it's for the wrong PID, or the
!                      * recorded start time is before pg_ctl started, then
!                      * either we are looking at the wrong data directory, or
!                      * this is a pre-existing pidfile that hasn't (yet?) been
!                      * overwritten by our child postmaster.  Allow 2 seconds
!                      * slop for possible cross-process clock skew.
                       */
                      pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
                      pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!                     if (pmstart >= start_time - 2 &&
! #ifndef WIN32
!                         pmpid == pm_pid
! #else
!                     /* Windows can only reject standalone-backend PIDs */
!                         pmpid > 0
! #endif
!                         )
                      {
                          /*
                           * OK, seems to be a valid pidfile from our child.
*************** test_postmaster_connection(bool do_check
*** 567,575 ****
                          char       *hostaddr;
                          char        host_str[MAXPGPATH];

-                         found_stale_pidfile = false;
-                         pm_pid = (pgpid_t) pmpid;
-
                          /*
                           * Extract port number and host string to use. Prefer
                           * using Unix socket if available.
--- 606,611 ----
*************** test_postmaster_connection(bool do_check
*** 635,675 ****
          }

          /*
!          * The postmaster should create postmaster.pid very soon after being
!          * started.  If it's not there after we've waited 5 or more seconds,
!          * assume startup failed and give up waiting.  (Note this covers both
!          * cases where the pidfile was never created, and where it was created
!          * and then removed during postmaster exit.)  Also, if there *is* a
!          * file there but it appears stale, issue a suitable warning and give
!          * up waiting.
           */
!         if (i >= 5)
          {
!             struct stat statbuf;
!
!             if (stat(pid_file, &statbuf) != 0)
!             {
!                 if (errno != ENOENT)
!                     write_stderr(_("\n%s: could not stat file \"%s\": %s\n"),
!                                  progname, pid_file, strerror(errno));
!                 return PQPING_NO_RESPONSE;
!             }

!             if (found_stale_pidfile)
!             {
!                 write_stderr(_("\n%s: this data directory appears to be running a pre-existing postmaster\n"),
!                              progname);
!                 return PQPING_NO_RESPONSE;
!             }
          }

          /*
!          * If we've been able to identify the child postmaster's PID, check
!          * the process is still alive.  This covers cases where the postmaster
!          * successfully created the pidfile but then crashed without removing
!          * it.
           */
!         if (pm_pid > 0 && !postmaster_is_alive((pid_t) pm_pid))
              return PQPING_NO_RESPONSE;

          /* No response, or startup still in process; wait */
--- 671,695 ----
          }

          /*
!          * Reap child process if it died; if it remains in zombie state then
!          * our kill-based test will think it's still running.
           */
! #ifndef WIN32
          {
!             int            exitstatus;

!             (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);
          }
+ #endif

          /*
!          * Check whether the child postmaster process is still alive.  This
!          * lets us exit early if the postmaster fails during startup.
!          *
!          * On Windows, we are checking the postmaster's parent shell, but
!          * that's fine for this purpose.
           */
!         if (!postmaster_is_alive((pid_t) pm_pid))
              return PQPING_NO_RESPONSE;

          /* No response, or startup still in process; wait */
*************** static void
*** 836,842 ****
  do_start(void)
  {
      pgpid_t        old_pid = 0;
!     int            exitcode;

      if (ctl_command != RESTART_COMMAND)
      {
--- 856,862 ----
  do_start(void)
  {
      pgpid_t        old_pid = 0;
!     pgpid_t        pm_pid;

      if (ctl_command != RESTART_COMMAND)
      {
*************** do_start(void)
*** 876,894 ****
      }
  #endif

!     exitcode = start_postmaster();
!     if (exitcode != 0)
!     {
!         write_stderr(_("%s: could not start server: exit code was %d\n"),
!                      progname, exitcode);
!         exit(1);
!     }

      if (do_wait)
      {
          print_msg(_("waiting for server to start..."));

!         switch (test_postmaster_connection(false))
          {
              case PQPING_OK:
                  print_msg(_(" done\n"));
--- 896,908 ----
      }
  #endif

!     pm_pid = start_postmaster();

      if (do_wait)
      {
          print_msg(_("waiting for server to start..."));

!         switch (test_postmaster_connection(pm_pid, false))
          {
              case PQPING_OK:
                  print_msg(_(" done\n"));
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1585,1591 ****
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (test_postmaster_connection(true) != PQPING_OK)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
--- 1599,1605 ----
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (test_postmaster_connection(postmasterPID, true) != PQPING_OK)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1606,1613 ****
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * test_postmaster_connection(true), so it might not start
!                  * from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;;

--- 1620,1626 ----
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * test_postmaster_connection(), so it might not start from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;;

diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 177a16f..3cd5a44 100644
*** a/src/bin/pg_ctl/t/001_start_stop.pl
--- b/src/bin/pg_ctl/t/001_start_stop.pl
*************** else
*** 34,41 ****
  close CONF;
  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
      'pg_ctl start -w');
! command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
!     'second pg_ctl start succeeds');
  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
      'pg_ctl stop -w');
  command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
--- 34,41 ----
  close CONF;
  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
      'pg_ctl start -w');
! command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
!     'second pg_ctl start fails');
  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
      'pg_ctl stop -w');
  command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],

pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Tab completion for ALTER COLUMN SET STATISTICS
Next
From: Robert Haas
Date:
Subject: Re: Parallel Seq Scan