Thread: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Noah Misch
Date:
My AIX buildfarm members have failed the BinInstallCheck step on and off since
inception.  It became more frequent when I added animals sungazer and tern
alongside the older hornet and mandrill.  The animals share a machine with
each other and with dozens of other developers.  I setpriority() the animals
to the lowest available priority, so they probably lose the CPU for long
periods.  Separately, this machine has slow filesystem metadata operations.
For example, git-new-workdir takes ~50s for a PostgreSQL tree.

The pg_rewind suite has failed a few times when crash recovery took longer
than the 60s pg_ctl default timeout.  Disabling fsync (commit 7d7a103) reduced
median crash recovery time by 75%, which may suffice.  If not, I'll be
inclined to add --timeout=900 to each pg_ctl invocation.


The pg_ctl suite has failed with "not ok 12 - second pg_ctl start succeeds".
You can reproduce that by adding "sleep 3;" between that test and the one
before it.  The timing dependency comes from the pg_ctl "slop" time:
                /*                 * Make sanity checks.  If it's for a standalone backend                 * (negative
PID),or the recorded start time is before                 * pg_ctl started, then either we are looking at the wrong
           * data directory, or this is a pre-existing pidfile that                 * hasn't (yet?) been overwritten by
ourchild postmaster.                 * Allow 2 seconds slop for possible cross-process clock                 * skew.
            */
 

The "second pg_ctl start succeeds" tested-for behavior is actually a minor bug
that we'd ideally fix as described in the last paragraph of the commit 3c485ca
log message:
   All of this could be improved if we rewrote start_postmaster() so that it   could report the child postmaster's PID,
sothat we'd know a-priori the   correct PID to test with postmaster_is_alive().  That looks like a bit too   much
changefor so late in the 9.1 development cycle, unfortunately.
 

I recommend we invert the test expectation and, pending the ideal pg_ctl fix,
add the "sleep 3" to avoid falling within the time slop:

--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -35,6 +35,7 @@ close CONF;command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],    'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
-    'second pg_ctl start succeeds');
+sleep 3;    # bridge test_postmaster_connection() slop threshold
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+    'second pg_ctl start fails');command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
'pg_ctlstop -w');
 


Alternately, I could just remove the test.

crake failed the same way, once:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Andres Freund
Date:
On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
> --- a/src/bin/pg_ctl/t/001_start_stop.pl
> +++ b/src/bin/pg_ctl/t/001_start_stop.pl
> @@ -35,6 +35,7 @@ close CONF;
>  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>      'pg_ctl start -w');
> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
> -    'second pg_ctl start succeeds');
> +sleep 3;    # bridge test_postmaster_connection() slop threshold
> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
> +    'second pg_ctl start fails');
>  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
>      'pg_ctl stop -w');

I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.

The second pg_ctl start doesn't seem to test something very
interesting. I'm inclined to just remove it. I'm not caffeinated
sufficiently, but afaics that ought to sidestep the issue as stop
doesn't depend on the slop time?

> crake failed the same way, once:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06

Sounds like an actual production hazard too.

- Andres



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
>> --- a/src/bin/pg_ctl/t/001_start_stop.pl
>> +++ b/src/bin/pg_ctl/t/001_start_stop.pl
>> @@ -35,6 +35,7 @@ close CONF;
>> command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> 'pg_ctl start -w');
>> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> -    'second pg_ctl start succeeds');
>> +sleep 3;    # bridge test_postmaster_connection() slop threshold
>> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> +    'second pg_ctl start fails');
>> command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
>> 'pg_ctl stop -w');

> I'don't like adding a couple seconds of test runtime for the benefit of
> very slow platforms.

Me either.  This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.  I think we ought
to just fix it.  I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I'don't like adding a couple seconds of test runtime for the benefit of
>> very slow platforms.

> Me either.  This is the first time I've seen an indication that the
> start_postmaster change mentioned in the comment is actually important
> for production use, rather than just being cleanup.  I think we ought
> to just fix it.  I'm willing to take care of the Unix side if someone
> will explain how to change the Windows side.

Attached is a draft patch for this.  I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows.  The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster.  If so, I do not know how to fix that, or whether
it's fixable at all.

Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.

If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.

            regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 6a36d29..f0025d7 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 28,33 ****
--- 28,34 ----
  #include <time.h>
  #include <sys/types.h>
  #include <sys/stat.h>
+ #include <sys/wait.h>
  #include <unistd.h>

  #ifdef HAVE_SYS_RESOURCE_H
*************** static int    CreateRestrictedProcess(char
*** 153,162 ****
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static int    start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(bool);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
--- 154,163 ----
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static pgpid_t start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
*************** free_readfile(char **optlines)
*** 419,454 ****
   * start/test/stop routines
   */

! static int
  start_postmaster(void)
  {
      char        cmd[MAXPGPATH];

  #ifndef WIN32

      /*
       * Since there might be quotes to handle here, it is easier simply to pass
       * everything to a shell to process them.
-      *
-      * XXX it would be better to fork and exec so that we would know the child
-      * postmaster's PID directly; then test_postmaster_connection could use
-      * the PID without having to rely on reading it back from the pidfile.
       */
      if (log_file != NULL)
!         snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &",
                   exec_path, pgdata_opt, post_opts,
                   DEVNULL, log_file);
      else
!         snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &",
                   exec_path, pgdata_opt, post_opts, DEVNULL);

!     return system(cmd);
  #else                            /* WIN32 */

      /*
!      * On win32 we don't use system(). So we don't need to use & (which would
!      * be START /B on win32). However, we still call the shell (CMD.EXE) with
!      * it to handle redirection etc.
       */
      PROCESS_INFORMATION pi;

--- 420,479 ----
   * start/test/stop routines
   */

! static pgpid_t
  start_postmaster(void)
  {
      char        cmd[MAXPGPATH];

  #ifndef WIN32
+     pgpid_t        pm_pid;
+
+     /* Flush stdio channels just before fork, to avoid double-output problems */
+     fflush(stdout);
+     fflush(stderr);
+
+     pm_pid = fork();
+     if (pm_pid < 0)
+     {
+         /* fork failed */
+         write_stderr(_("%s: could not start server: %s\n"),
+                      progname, strerror(errno));
+         exit(1);
+     }
+     if (pm_pid > 0)
+     {
+         /* fork succeeded, in parent */
+         return pm_pid;
+     }
+
+     /* fork succeeded, in child */

      /*
       * Since there might be quotes to handle here, it is easier simply to pass
       * everything to a shell to process them.
       */
      if (log_file != NULL)
!         snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
                   exec_path, pgdata_opt, post_opts,
                   DEVNULL, log_file);
      else
!         snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
                   exec_path, pgdata_opt, post_opts, DEVNULL);

!     (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
!
!     /* exec failed */
!     write_stderr(_("%s: could not start server: %s\n"),
!                  progname, strerror(errno));
!     exit(1);
!
!     return 0;                    /* keep dumb compilers quiet */
!
  #else                            /* WIN32 */

      /*
!      * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
!      * handle redirection etc.
       */
      PROCESS_INFORMATION pi;

*************** start_postmaster(void)
*** 460,469 ****
                   exec_path, pgdata_opt, post_opts, DEVNULL);

      if (!CreateRestrictedProcess(cmd, &pi, false))
!         return GetLastError();
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
!     return 0;
  #endif   /* WIN32 */
  }

--- 485,498 ----
                   exec_path, pgdata_opt, post_opts, DEVNULL);

      if (!CreateRestrictedProcess(cmd, &pi, false))
!     {
!         write_stderr(_("%s: could not start server: error code %lu\n"),
!                      progname, (unsigned long) GetLastError());
!         exit(1);
!     }
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
!     return pi.dwProcessId;
  #endif   /* WIN32 */
  }

*************** start_postmaster(void)
*** 476,486 ****
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
  static PGPing
! test_postmaster_connection(bool do_checkpoint)
  {
      PGPing        ret = PQPING_NO_RESPONSE;
-     bool        found_stale_pidfile = false;
-     pgpid_t        pm_pid = 0;
      char        connstr[MAXPGPATH * 2 + 256];
      int            i;

--- 505,513 ----
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
  static PGPing
! test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint)
  {
      PGPing        ret = PQPING_NO_RESPONSE;
      char        connstr[MAXPGPATH * 2 + 256];
      int            i;

*************** test_postmaster_connection(bool do_check
*** 535,563 ****
                           optlines[5] != NULL)
                  {
                      /* File is complete enough for us, parse it */
!                     long        pmpid;
                      time_t        pmstart;

                      /*
!                      * Make sanity checks.  If it's for a standalone backend
!                      * (negative PID), or the recorded start time is before
!                      * pg_ctl started, then either we are looking at the wrong
!                      * data directory, or this is a pre-existing pidfile that
!                      * hasn't (yet?) been overwritten by our child postmaster.
!                      * Allow 2 seconds slop for possible cross-process clock
!                      * skew.
                       */
                      pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
                      pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!                     if (pmpid <= 0 || pmstart < start_time - 2)
!                     {
!                         /*
!                          * Set flag to report stale pidfile if it doesn't get
!                          * overwritten before we give up waiting.
!                          */
!                         found_stale_pidfile = true;
!                     }
!                     else
                      {
                          /*
                           * OK, seems to be a valid pidfile from our child.
--- 562,581 ----
                           optlines[5] != NULL)
                  {
                      /* File is complete enough for us, parse it */
!                     pgpid_t        pmpid;
                      time_t        pmstart;

                      /*
!                      * Make sanity checks.  If it's for the wrong PID, or the
!                      * recorded start time is before pg_ctl started, then
!                      * either we are looking at the wrong data directory, or
!                      * this is a pre-existing pidfile that hasn't (yet?) been
!                      * overwritten by our child postmaster.  Allow 2 seconds
!                      * slop for possible cross-process clock skew.
                       */
                      pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
                      pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!                     if (pmpid == pm_pid && pmstart >= start_time - 2)
                      {
                          /*
                           * OK, seems to be a valid pidfile from our child.
*************** test_postmaster_connection(bool do_check
*** 567,575 ****
                          char       *hostaddr;
                          char        host_str[MAXPGPATH];

-                         found_stale_pidfile = false;
-                         pm_pid = (pgpid_t) pmpid;
-
                          /*
                           * Extract port number and host string to use. Prefer
                           * using Unix socket if available.
--- 585,590 ----
*************** test_postmaster_connection(bool do_check
*** 635,675 ****
          }

          /*
!          * The postmaster should create postmaster.pid very soon after being
!          * started.  If it's not there after we've waited 5 or more seconds,
!          * assume startup failed and give up waiting.  (Note this covers both
!          * cases where the pidfile was never created, and where it was created
!          * and then removed during postmaster exit.)  Also, if there *is* a
!          * file there but it appears stale, issue a suitable warning and give
!          * up waiting.
           */
!         if (i >= 5)
          {
!             struct stat statbuf;
!
!             if (stat(pid_file, &statbuf) != 0)
!             {
!                 if (errno != ENOENT)
!                     write_stderr(_("\n%s: could not stat file \"%s\": %s\n"),
!                                  progname, pid_file, strerror(errno));
!                 return PQPING_NO_RESPONSE;
!             }

!             if (found_stale_pidfile)
!             {
!                 write_stderr(_("\n%s: this data directory appears to be running a pre-existing postmaster\n"),
!                              progname);
!                 return PQPING_NO_RESPONSE;
!             }
          }

          /*
!          * If we've been able to identify the child postmaster's PID, check
!          * the process is still alive.  This covers cases where the postmaster
!          * successfully created the pidfile but then crashed without removing
!          * it.
           */
!         if (pm_pid > 0 && !postmaster_is_alive((pid_t) pm_pid))
              return PQPING_NO_RESPONSE;

          /* No response, or startup still in process; wait */
--- 650,671 ----
          }

          /*
!          * Reap child process if it died; if it remains in zombie state then
!          * our kill-based test will think it's still running.
           */
! #ifndef WIN32
          {
!             int            exitstatus;

!             (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);
          }
+ #endif

          /*
!          * Check whether the child postmaster process is still alive.  This
!          * lets us exit early if the postmaster fails during startup.
           */
!         if (!postmaster_is_alive((pid_t) pm_pid))
              return PQPING_NO_RESPONSE;

          /* No response, or startup still in process; wait */
*************** static void
*** 836,842 ****
  do_start(void)
  {
      pgpid_t        old_pid = 0;
!     int            exitcode;

      if (ctl_command != RESTART_COMMAND)
      {
--- 832,838 ----
  do_start(void)
  {
      pgpid_t        old_pid = 0;
!     pgpid_t        pm_pid;

      if (ctl_command != RESTART_COMMAND)
      {
*************** do_start(void)
*** 876,894 ****
      }
  #endif

!     exitcode = start_postmaster();
!     if (exitcode != 0)
!     {
!         write_stderr(_("%s: could not start server: exit code was %d\n"),
!                      progname, exitcode);
!         exit(1);
!     }

      if (do_wait)
      {
          print_msg(_("waiting for server to start..."));

!         switch (test_postmaster_connection(false))
          {
              case PQPING_OK:
                  print_msg(_(" done\n"));
--- 872,884 ----
      }
  #endif

!     pm_pid = start_postmaster();

      if (do_wait)
      {
          print_msg(_("waiting for server to start..."));

!         switch (test_postmaster_connection(pm_pid, false))
          {
              case PQPING_OK:
                  print_msg(_(" done\n"));
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1585,1591 ****
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (test_postmaster_connection(true) != PQPING_OK)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
--- 1575,1581 ----
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (test_postmaster_connection(postmasterPID, true) != PQPING_OK)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1606,1613 ****
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * test_postmaster_connection(true), so it might not start
!                  * from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;;

--- 1596,1602 ----
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * test_postmaster_connection(), so it might not start from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;;

diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index dae47a8..2f93aa0 100644
*** a/src/bin/pg_ctl/t/001_start_stop.pl
--- b/src/bin/pg_ctl/t/001_start_stop.pl
*************** else
*** 32,39 ****
  close CONF;
  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
      'pg_ctl start -w');
! command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
!     'second pg_ctl start succeeds');
  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
      'pg_ctl stop -w');
  command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
--- 32,39 ----
  close CONF;
  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
      'pg_ctl start -w');
! command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
!     'second pg_ctl start fails');
  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
      'pg_ctl stop -w');
  command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],

Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Andrew Dunstan
Date:

On 09/03/2015 03:31 PM, Tom Lane wrote:
> I wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> I'don't like adding a couple seconds of test runtime for the benefit of
>>> very slow platforms.
>> Me either.  This is the first time I've seen an indication that the
>> start_postmaster change mentioned in the comment is actually important
>> for production use, rather than just being cleanup.  I think we ought
>> to just fix it.  I'm willing to take care of the Unix side if someone
>> will explain how to change the Windows side.
> Attached is a draft patch for this.  I think it's fine for Unix (unless
> someone wants to object to relying on "/bin/sh -c"), but I have no idea
> whether it works for Windows.  The main risk is that if CMD.EXE runs
> the postmaster as a subprocess rather than overlaying itself a la shell
> "exec", the PID we'd get back would apply only to CMD.EXE not to the
> eventual postmaster.  If so, I do not know how to fix that, or whether
> it's fixable at all.
>
> Note that this makes the test case in question fail reliably, which is
> reasonable behavior IMO so I just changed the test.
>
> If this does (or can be made to) work on Windows, I'd propose applying
> it back to 9.2, where the current coding came in.



There is no equivalent of execl, nor a cmd.exe exquivalent of the 
shell's exec. But surely the equivalent of the fork/execl you're doing 
here would be a simple CreateProcess(). I don't see why you need a shell 
in the middle on Windows at all.

cheers

andrew




Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> There is no equivalent of execl, nor a cmd.exe exquivalent of the 
> shell's exec. But surely the equivalent of the fork/execl you're doing 
> here would be a simple CreateProcess(). I don't see why you need a shell 
> in the middle on Windows at all.

The problem is to get the output redirection to work.  I imagine it could
be rewritten without involving the shell, but it'd be less than trivial
(or at least, I'm not going to do it).

If we did get that to work, it could probably be applied to the
service-start case as well, which currently just does a CreateProcess and
pays no attention to what happens to postmaster stderr.
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > There is no equivalent of execl, nor a cmd.exe exquivalent of the 
> > shell's exec. But surely the equivalent of the fork/execl you're doing 
> > here would be a simple CreateProcess(). I don't see why you need a shell 
> > in the middle on Windows at all.
> 
> The problem is to get the output redirection to work.  I imagine it could
> be rewritten without involving the shell, but it'd be less than trivial
> (or at least, I'm not going to do it).

In the CreateProcess model, I think you can use startupInfo.hStdOutput,
see

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425%28v=vs.85%29.aspx

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Noah Misch
Date:
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> I wrote:
> > This is the first time I've seen an indication that the
> > start_postmaster change mentioned in the comment is actually important
> > for production use, rather than just being cleanup.

I scratched my head awhile without thinking of a credible production use case
that would notice the difference.  Which one did you have in mind?

> Note that this makes the test case in question fail reliably, which is
> reasonable behavior IMO so I just changed the test.

That's the better behavior, agreed.



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
>>> This is the first time I've seen an indication that the
>>> start_postmaster change mentioned in the comment is actually important
>>> for production use, rather than just being cleanup.

> I scratched my head awhile without thinking of a credible production use case
> that would notice the difference.  Which one did you have in mind?

Well, mainly that it's making our regression tests fail, which suggests
behavioral instability that could be important for production.

Aside from the specific case you diagnosed, it's clear from buildfarm
results that the five-second timeout elsewhere in
test_postmaster_connection has got funny behavior under load; there are
way too many cases where pg_ctl gives up after exactly that long, with
no useful information printed.  The draft patch I posted gets rid of that
behavior ...
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Noah Misch
Date:
On Fri, Sep 04, 2015 at 10:54:51PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> >>> This is the first time I've seen an indication that the
> >>> start_postmaster change mentioned in the comment is actually important
> >>> for production use, rather than just being cleanup.
> 
> > I scratched my head awhile without thinking of a credible production use case
> > that would notice the difference.  Which one did you have in mind?
> 
> Well, mainly that it's making our regression tests fail, which suggests
> behavioral instability that could be important for production.

I doubt issuing two "pg_ctl start" within two seconds, with no intervening
stop, has a production use.  However, ...

> Aside from the specific case you diagnosed, it's clear from buildfarm
> results that the five-second timeout elsewhere in
> test_postmaster_connection has got funny behavior under load; there are
> way too many cases where pg_ctl gives up after exactly that long, with
> no useful information printed.  The draft patch I posted gets rid of that
> behavior ...

... fixing this is a good deal more valuable.  Besides coping with stalls in
early postmaster startup, pg_ctl would no longer take 5s to exit when the
postmaster reports a postgresql.conf defect or other early error.

On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> If this does (or can be made to) work on Windows, I'd propose applying
> it back to 9.2, where the current coding came in.

That seems prudent.  We'll keep meeting buildfarm members with performance
variability sufficient to reach that 5s timeout.  (Incidentally, the current
coding first appeared in 9.1.)

> --- 650,671 ----
>           }
>   
>           /*
> !          * Reap child process if it died; if it remains in zombie state then
> !          * our kill-based test will think it's still running.
>            */
> ! #ifndef WIN32
>           {
> !             int            exitstatus;
>   
> !             (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);

Let's report any exit status.  When a signal kills the young postmaster, it's
a waste to omit that detail.

>           }
> + #endif
>   
>           /*
> !          * Check whether the child postmaster process is still alive.  This
> !          * lets us exit early if the postmaster fails during startup.
>            */
> !         if (!postmaster_is_alive((pid_t) pm_pid))
>               return PQPING_NO_RESPONSE;

kill() adds nothing when dealing with a pg_ctl child.  waitpid() is enough.

Thanks,
nm



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> kill() adds nothing when dealing with a pg_ctl child.  waitpid() is enough.

The kill() coding works on Windows (I believe); waitpid() not so much.
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Noah Misch
Date:
On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > kill() adds nothing when dealing with a pg_ctl child.  waitpid() is enough.
> 
> The kill() coding works on Windows (I believe); waitpid() not so much.

Windows has the concept under a different name.  See postmaster.c.



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote:
>> The kill() coding works on Windows (I believe); waitpid() not so much.

> Windows has the concept under a different name.  See postmaster.c.

Well, if someone else wants to code and test that, feel free.  I don't
do Windows ...
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
I wrote:
> Attached is a draft patch for this.  I think it's fine for Unix (unless
> someone wants to object to relying on "/bin/sh -c"), but I have no idea
> whether it works for Windows.  The main risk is that if CMD.EXE runs
> the postmaster as a subprocess rather than overlaying itself a la shell
> "exec", the PID we'd get back would apply only to CMD.EXE not to the
> eventual postmaster.  If so, I do not know how to fix that, or whether
> it's fixable at all.

> Note that this makes the test case in question fail reliably, which is
> reasonable behavior IMO so I just changed the test.

> If this does (or can be made to) work on Windows, I'd propose applying
> it back to 9.2, where the current coding came in.

Nobody seems to have stepped up to fix the Windows side of this, but
after some more thought it occurred to me that maybe it doesn't need
(much) fixing.  There are two things that pg_ctl wants the PID for:

1. To check if the postmaster.pid file contains the expected child process
PID.  Well, that's a nice-to-have, but we never had it before and got
along well enough, because the start-timestamp comparison check covers
most real-world cases.  We can omit the PID-match check on Windows.

2. To see whether the child postmaster process is still running, via a
"kill(pid, 0)" test.  But if we've launched the postmaster through a
shell, and not used "&" semantics, presumably that shell will wait for its
child process to exit and then exit itself.  So *testing whether the shell
is still there is just about as good as testing the real postmaster PID*.

Therefore, we don't really need to worry about rewriting the Windows
subprocess-launch logic.  If someone would like to do it, that would
be nice, but we don't have to wait for that to happen before we can
get rid of the 5-second-timeout issues.

So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before.  I believe that this
might actually work on Windows, but I have no way to test it.  Someone
please try that?  (Don't forget to test the service-start path, too.)

            regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 6a36d29..62db328 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 28,33 ****
--- 28,34 ----
  #include <time.h>
  #include <sys/types.h>
  #include <sys/stat.h>
+ #include <sys/wait.h>
  #include <unistd.h>

  #ifdef HAVE_SYS_RESOURCE_H
*************** static int    CreateRestrictedProcess(char
*** 153,162 ****
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static int    start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(bool);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
--- 154,163 ----
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static pgpid_t start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
*************** free_readfile(char **optlines)
*** 419,454 ****
   * start/test/stop routines
   */

! static int
  start_postmaster(void)
  {
      char        cmd[MAXPGPATH];

  #ifndef WIN32

      /*
       * Since there might be quotes to handle here, it is easier simply to pass
!      * everything to a shell to process them.
!      *
!      * XXX it would be better to fork and exec so that we would know the child
!      * postmaster's PID directly; then test_postmaster_connection could use
!      * the PID without having to rely on reading it back from the pidfile.
       */
      if (log_file != NULL)
!         snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &",
                   exec_path, pgdata_opt, post_opts,
                   DEVNULL, log_file);
      else
!         snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &",
                   exec_path, pgdata_opt, post_opts, DEVNULL);

!     return system(cmd);
  #else                            /* WIN32 */

      /*
!      * On win32 we don't use system(). So we don't need to use & (which would
!      * be START /B on win32). However, we still call the shell (CMD.EXE) with
!      * it to handle redirection etc.
       */
      PROCESS_INFORMATION pi;

--- 420,489 ----
   * start/test/stop routines
   */

! /*
!  * Start the postmaster and return its PID.
!  *
!  * Currently, on Windows what we return is the PID of the shell process
!  * that launched the postmaster (and, we trust, is waiting for it to exit).
!  * So the PID is usable for "is the postmaster still running" checks,
!  * but cannot be compared directly to postmaster.pid.
!  */
! static pgpid_t
  start_postmaster(void)
  {
      char        cmd[MAXPGPATH];

  #ifndef WIN32
+     pgpid_t        pm_pid;
+
+     /* Flush stdio channels just before fork, to avoid double-output problems */
+     fflush(stdout);
+     fflush(stderr);
+
+     pm_pid = fork();
+     if (pm_pid < 0)
+     {
+         /* fork failed */
+         write_stderr(_("%s: could not start server: %s\n"),
+                      progname, strerror(errno));
+         exit(1);
+     }
+     if (pm_pid > 0)
+     {
+         /* fork succeeded, in parent */
+         return pm_pid;
+     }
+
+     /* fork succeeded, in child */

      /*
       * Since there might be quotes to handle here, it is easier simply to pass
!      * everything to a shell to process them.  Use exec so that the postmaster
!      * has the same PID as the current child process.
       */
      if (log_file != NULL)
!         snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
                   exec_path, pgdata_opt, post_opts,
                   DEVNULL, log_file);
      else
!         snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
                   exec_path, pgdata_opt, post_opts, DEVNULL);

!     (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
!
!     /* exec failed */
!     write_stderr(_("%s: could not start server: %s\n"),
!                  progname, strerror(errno));
!     exit(1);
!
!     return 0;                    /* keep dumb compilers quiet */
!
  #else                            /* WIN32 */

      /*
!      * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
!      * handle redirection etc.  Unfortunately CMD.EXE lacks any equivalent of
!      * "exec", so we don't get to find out the postmaster's PID immediately.
       */
      PROCESS_INFORMATION pi;

*************** start_postmaster(void)
*** 460,469 ****
                   exec_path, pgdata_opt, post_opts, DEVNULL);

      if (!CreateRestrictedProcess(cmd, &pi, false))
!         return GetLastError();
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
!     return 0;
  #endif   /* WIN32 */
  }

--- 495,508 ----
                   exec_path, pgdata_opt, post_opts, DEVNULL);

      if (!CreateRestrictedProcess(cmd, &pi, false))
!     {
!         write_stderr(_("%s: could not start server: error code %lu\n"),
!                      progname, (unsigned long) GetLastError());
!         exit(1);
!     }
      CloseHandle(pi.hProcess);
      CloseHandle(pi.hThread);
!     return pi.dwProcessId;        /* Shell's PID, not postmaster's! */
  #endif   /* WIN32 */
  }

*************** start_postmaster(void)
*** 472,486 ****
  /*
   * Find the pgport and try a connection
   *
   * Note that the checkpoint parameter enables a Windows service control
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
  static PGPing
! test_postmaster_connection(bool do_checkpoint)
  {
      PGPing        ret = PQPING_NO_RESPONSE;
-     bool        found_stale_pidfile = false;
-     pgpid_t        pm_pid = 0;
      char        connstr[MAXPGPATH * 2 + 256];
      int            i;

--- 511,527 ----
  /*
   * Find the pgport and try a connection
   *
+  * On Unix, pm_pid is the PID of the just-launched postmaster.  On Windows,
+  * it's the PID of an ancestor shell process, so we can't check the contents
+  * of postmaster.pid quite as carefully.
+  *
   * Note that the checkpoint parameter enables a Windows service control
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
  static PGPing
! test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint)
  {
      PGPing        ret = PQPING_NO_RESPONSE;
      char        connstr[MAXPGPATH * 2 + 256];
      int            i;

*************** test_postmaster_connection(bool do_check
*** 535,563 ****
                           optlines[5] != NULL)
                  {
                      /* File is complete enough for us, parse it */
!                     long        pmpid;
                      time_t        pmstart;

                      /*
!                      * Make sanity checks.  If it's for a standalone backend
!                      * (negative PID), or the recorded start time is before
!                      * pg_ctl started, then either we are looking at the wrong
!                      * data directory, or this is a pre-existing pidfile that
!                      * hasn't (yet?) been overwritten by our child postmaster.
!                      * Allow 2 seconds slop for possible cross-process clock
!                      * skew.
                       */
                      pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
                      pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!                     if (pmpid <= 0 || pmstart < start_time - 2)
!                     {
!                         /*
!                          * Set flag to report stale pidfile if it doesn't get
!                          * overwritten before we give up waiting.
!                          */
!                         found_stale_pidfile = true;
!                     }
!                     else
                      {
                          /*
                           * OK, seems to be a valid pidfile from our child.
--- 576,602 ----
                           optlines[5] != NULL)
                  {
                      /* File is complete enough for us, parse it */
!                     pgpid_t        pmpid;
                      time_t        pmstart;

                      /*
!                      * Make sanity checks.  If it's for the wrong PID, or the
!                      * recorded start time is before pg_ctl started, then
!                      * either we are looking at the wrong data directory, or
!                      * this is a pre-existing pidfile that hasn't (yet?) been
!                      * overwritten by our child postmaster.  Allow 2 seconds
!                      * slop for possible cross-process clock skew.
                       */
                      pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
                      pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!                     if (pmstart >= start_time - 2 &&
! #ifndef WIN32
!                         pmpid == pm_pid
! #else
!                     /* Windows can only reject standalone-backend PIDs */
!                         pmpid > 0
! #endif
!                         )
                      {
                          /*
                           * OK, seems to be a valid pidfile from our child.
*************** test_postmaster_connection(bool do_check
*** 567,575 ****
                          char       *hostaddr;
                          char        host_str[MAXPGPATH];

-                         found_stale_pidfile = false;
-                         pm_pid = (pgpid_t) pmpid;
-
                          /*
                           * Extract port number and host string to use. Prefer
                           * using Unix socket if available.
--- 606,611 ----
*************** test_postmaster_connection(bool do_check
*** 635,675 ****
          }

          /*
!          * The postmaster should create postmaster.pid very soon after being
!          * started.  If it's not there after we've waited 5 or more seconds,
!          * assume startup failed and give up waiting.  (Note this covers both
!          * cases where the pidfile was never created, and where it was created
!          * and then removed during postmaster exit.)  Also, if there *is* a
!          * file there but it appears stale, issue a suitable warning and give
!          * up waiting.
           */
!         if (i >= 5)
          {
!             struct stat statbuf;
!
!             if (stat(pid_file, &statbuf) != 0)
!             {
!                 if (errno != ENOENT)
!                     write_stderr(_("\n%s: could not stat file \"%s\": %s\n"),
!                                  progname, pid_file, strerror(errno));
!                 return PQPING_NO_RESPONSE;
!             }

!             if (found_stale_pidfile)
!             {
!                 write_stderr(_("\n%s: this data directory appears to be running a pre-existing postmaster\n"),
!                              progname);
!                 return PQPING_NO_RESPONSE;
!             }
          }

          /*
!          * If we've been able to identify the child postmaster's PID, check
!          * the process is still alive.  This covers cases where the postmaster
!          * successfully created the pidfile but then crashed without removing
!          * it.
           */
!         if (pm_pid > 0 && !postmaster_is_alive((pid_t) pm_pid))
              return PQPING_NO_RESPONSE;

          /* No response, or startup still in process; wait */
--- 671,695 ----
          }

          /*
!          * Reap child process if it died; if it remains in zombie state then
!          * our kill-based test will think it's still running.
           */
! #ifndef WIN32
          {
!             int            exitstatus;

!             (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);
          }
+ #endif

          /*
!          * Check whether the child postmaster process is still alive.  This
!          * lets us exit early if the postmaster fails during startup.
!          *
!          * On Windows, we are checking the postmaster's parent shell, but
!          * that's fine for this purpose.
           */
!         if (!postmaster_is_alive((pid_t) pm_pid))
              return PQPING_NO_RESPONSE;

          /* No response, or startup still in process; wait */
*************** static void
*** 836,842 ****
  do_start(void)
  {
      pgpid_t        old_pid = 0;
!     int            exitcode;

      if (ctl_command != RESTART_COMMAND)
      {
--- 856,862 ----
  do_start(void)
  {
      pgpid_t        old_pid = 0;
!     pgpid_t        pm_pid;

      if (ctl_command != RESTART_COMMAND)
      {
*************** do_start(void)
*** 876,894 ****
      }
  #endif

!     exitcode = start_postmaster();
!     if (exitcode != 0)
!     {
!         write_stderr(_("%s: could not start server: exit code was %d\n"),
!                      progname, exitcode);
!         exit(1);
!     }

      if (do_wait)
      {
          print_msg(_("waiting for server to start..."));

!         switch (test_postmaster_connection(false))
          {
              case PQPING_OK:
                  print_msg(_(" done\n"));
--- 896,908 ----
      }
  #endif

!     pm_pid = start_postmaster();

      if (do_wait)
      {
          print_msg(_("waiting for server to start..."));

!         switch (test_postmaster_connection(pm_pid, false))
          {
              case PQPING_OK:
                  print_msg(_(" done\n"));
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1585,1591 ****
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (test_postmaster_connection(true) != PQPING_OK)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
--- 1599,1605 ----
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (test_postmaster_connection(postmasterPID, true) != PQPING_OK)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1606,1613 ****
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * test_postmaster_connection(true), so it might not start
!                  * from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;;

--- 1620,1626 ----
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * test_postmaster_connection(), so it might not start from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;;

diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 177a16f..3cd5a44 100644
*** a/src/bin/pg_ctl/t/001_start_stop.pl
--- b/src/bin/pg_ctl/t/001_start_stop.pl
*************** else
*** 34,41 ****
  close CONF;
  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
      'pg_ctl start -w');
! command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
!     'second pg_ctl start succeeds');
  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
      'pg_ctl stop -w');
  command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
--- 34,41 ----
  close CONF;
  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
      'pg_ctl start -w');
! command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
!     'second pg_ctl start fails');
  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
      'pg_ctl stop -w');
  command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],

Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
I wrote:
> So the attached modified patch adjusts the PID-match logic and some
> comments, but is otherwise what I posted before.  I believe that this
> might actually work on Windows, but I have no way to test it.  Someone
> please try that?  (Don't forget to test the service-start path, too.)

Has anyone tried that patch on Windows?  I'd like to push it in hopes of
improving buildfarm stability, but I'm hesitant to do so without some
confirmation that it acts as I think it will.
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Michael Paquier
Date:
On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> So the attached modified patch adjusts the PID-match logic and some
>> comments, but is otherwise what I posted before.  I believe that this
>> might actually work on Windows, but I have no way to test it.  Someone
>> please try that?  (Don't forget to test the service-start path, too.)
>
> Has anyone tried that patch on Windows?  I'd like to push it in hopes of
> improving buildfarm stability, but I'm hesitant to do so without some
> confirmation that it acts as I think it will.

I marked this patch as something to look at, though I haven't take the
time to do it yet...
-- 
Michael



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Michael Paquier
Date:
On Wed, Oct 7, 2015 at 7:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wrote:
>>> So the attached modified patch adjusts the PID-match logic and some
>>> comments, but is otherwise what I posted before.  I believe that this
>>> might actually work on Windows, but I have no way to test it.  Someone
>>> please try that?  (Don't forget to test the service-start path, too.)
>>
>> Has anyone tried that patch on Windows?  I'd like to push it in hopes of
>> improving buildfarm stability, but I'm hesitant to do so without some
>> confirmation that it acts as I think it will.
>
> I marked this patch as something to look at, though I haven't take the
> time to do it yet...

s/take/taken
-- 
Michael



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Michael Paquier
Date:
On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
> So the attached modified patch adjusts the PID-match logic and some
> comments, but is otherwise what I posted before.  I believe that this
> might actually work on Windows, but I have no way to test it.  Someone
> please try that?  (Don't forget to test the service-start path, too.)

If "pg_ctl start" is invoked directly from a command prompt, it works.
Now, if I run "pg_ctl start" within a script (in an msysgit
environment for example), pg_ctl fails, complaining that
postmaster.pid already exists. The TAP tests get broken as well,
attached are the log files of the run, and here is an interested bit:
# Running: pg_ctl start -D
C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w
waiting for server to start... stopped waiting
not ok 12 - pg_ctl start -w
#   Failed test 'pg_ctl start -w'
#   at C:/Users/ioltas/git/postgres/src/test/perl/TestLib.pm line 282.
# Running: pg_ctl start -D
C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w
waiting for server to start...FATAL:  lock file "postmaster.pid" already exists
HINT:  Is another postmaster (PID 4040) running in data directory
"C:/Users/ioltas/git/postgres/src/bin/pg_ctl/tmp_testor2K/data"?
LOG:  database system was shut down at 2015-10-06 23:58:02 PDT
LOG:  MultiXact member wraparound protections are now enabled
FATAL:  the database system is starting up
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
 stopped waiting
Registering a service to the SCM works, but the service registered
refuses to start.
Regards,
--
Michael

Attachment

Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>> So the attached modified patch adjusts the PID-match logic and some
>> comments, but is otherwise what I posted before.  I believe that this
>> might actually work on Windows, but I have no way to test it.  Someone
>> please try that?  (Don't forget to test the service-start path, too.)

> If "pg_ctl start" is invoked directly from a command prompt, it works.
> Now, if I run "pg_ctl start" within a script (in an msysgit
> environment for example), pg_ctl fails, complaining that
> postmaster.pid already exists. The TAP tests get broken as well,

Hm, the TAP trace makes it look like the postmaster start does actually
work, but pg_ctl isn't correctly waiting for that to happen.  (The
complaint about "already exists" seems to be from the second start
attempt, and is exactly what to expect there.)  I could believe that
I'd fat-fingered the PID comparison logic in test_postmaster_connection,
but I don't understand how it would work from a command prompt but not
from a script.  Any ideas?
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>> So the attached modified patch adjusts the PID-match logic and some
>> comments, but is otherwise what I posted before.  I believe that this
>> might actually work on Windows, but I have no way to test it.  Someone
>> please try that?  (Don't forget to test the service-start path, too.)

> If "pg_ctl start" is invoked directly from a command prompt, it works.
> Now, if I run "pg_ctl start" within a script (in an msysgit
> environment for example), pg_ctl fails, complaining that
> postmaster.pid already exists. The TAP tests get broken as well,

Reading that again, I think you mean that "pg_ctl start" works but you
didn't try "pg_ctl start -w" manually?  Because it's "-w" that's at
issue here, and the failing test cases are using that.

After studying the test logs more carefully, I think the behavior they're
showing is consistent with the idea that postmaster_is_alive() doesn't
work on the CMD shell process.  Which seems very likely now that I think
about it, because we're depending on an implementation of kill() that
is PG-specific.

So in fact this idea doesn't work :-(.

I think there is still room to salvage something without fully rewriting
the postmaster invocation logic to avoid using CMD, because it's still
true that checking whether the CMD process is still there should be as
good as checking the postmaster proper.  We just can't use kill() for it.
I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
mentioned upthread, if we're using fork/exec we might as well pay
attention to waitpid's results instead.  On Windows, I imagine that the
thing to do is have start_postmaster() save aside the handle for the CMD
process, and then in test_postmaster_connection(), check the handle state
to see if the process is still running (I assume there's a way to do
that).

I can take care of the Unix side of that, but as before, somebody else
would need to code and test the Windows side.  It's probably just a few
lines of code to add ... any volunteers?
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Michael Paquier
Date:
On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>>> So the attached modified patch adjusts the PID-match logic and some
>>> comments, but is otherwise what I posted before.  I believe that this
>>> might actually work on Windows, but I have no way to test it.  Someone
>>> please try that?  (Don't forget to test the service-start path, too.)
>
>> If "pg_ctl start" is invoked directly from a command prompt, it works.
>> Now, if I run "pg_ctl start" within a script (in an msysgit
>> environment for example), pg_ctl fails, complaining that
>> postmaster.pid already exists. The TAP tests get broken as well,
>
> Reading that again, I think you mean that "pg_ctl start" works but you
> didn't try "pg_ctl start -w" manually?  Because it's "-w" that's at
> issue here, and the failing test cases are using that.

Yes, sorry. I fat-fingered the command from the prompt and forgot the
-w switch. test_postmaster_connection just behaves incorrectly, and
exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting,
could not start server, blah".

> I think there is still room to salvage something without fully rewriting
> the postmaster invocation logic to avoid using CMD, because it's still
> true that checking whether the CMD process is still there should be as
> good as checking the postmaster proper.  We just can't use kill() for it.
> I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
> mentioned upthread, if we're using fork/exec we might as well pay
> attention to waitpid's results instead.  On Windows, I imagine that the
> thing to do is have start_postmaster() save aside the handle for the CMD
> process, and then in test_postmaster_connection(), check the handle state
> to see if the process is still running (I assume there's a way to do
> that).

That would be WaitForSingleObject(handle, ms_timeout) ==
WAIT_OBJECT_0, no? The handle should be picked up from
CreateRestrictedProcess, and I think that CloseHandle should not be
called on pi.hProcess if we are going to wait for it afterwards.
Reference:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx

> I can take care of the Unix side of that, but as before, somebody else
> would need to code and test the Windows side.  It's probably just a few
> lines of code to add ... any volunteers?

I could give it a shot, do you still want to rework the Unix portion
of the patch? There are not many folks around able to test the patch
either way.
Regards,
-- 
Michael



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Michael Paquier
Date:
On Thu, Oct 8, 2015 at 3:09 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>>>> So the attached modified patch adjusts the PID-match logic and some
>>>> comments, but is otherwise what I posted before.  I believe that this
>>>> might actually work on Windows, but I have no way to test it.  Someone
>>>> please try that?  (Don't forget to test the service-start path, too.)
>>
>>> If "pg_ctl start" is invoked directly from a command prompt, it works.
>>> Now, if I run "pg_ctl start" within a script (in an msysgit
>>> environment for example), pg_ctl fails, complaining that
>>> postmaster.pid already exists. The TAP tests get broken as well,
>>
>> Reading that again, I think you mean that "pg_ctl start" works but you
>> didn't try "pg_ctl start -w" manually?  Because it's "-w" that's at
>> issue here, and the failing test cases are using that.
>
> Yes, sorry. I fat-fingered the command from the prompt and forgot the
> -w switch. test_postmaster_connection just behaves incorrectly, and
> exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting,
> could not start server, blah".
>
>> I think there is still room to salvage something without fully rewriting
>> the postmaster invocation logic to avoid using CMD, because it's still
>> true that checking whether the CMD process is still there should be as
>> good as checking the postmaster proper.  We just can't use kill() for it.
>> I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
>> mentioned upthread, if we're using fork/exec we might as well pay
>> attention to waitpid's results instead.  On Windows, I imagine that the
>> thing to do is have start_postmaster() save aside the handle for the CMD
>> process, and then in test_postmaster_connection(), check the handle state
>> to see if the process is still running (I assume there's a way to do
>> that).
>
> That would be WaitForSingleObject(handle, ms_timeout) ==
> WAIT_OBJECT_0, no? The handle should be picked up from
> CreateRestrictedProcess, and I think that CloseHandle should not be
> called on pi.hProcess if we are going to wait for it afterwards.
> Reference:
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx

I had a look at that, and attached is an updated patch showing the
concept of using an HANDLE that pg_ctl waits for. I agree that saving
an HANDLE the way this patch does using a static variable is a bit
ugly especially knowing that service registration uses
test_postmaster_connection as well with directly a postmaster. The
good thing is that tapcheck runs smoothly, with one single failure
though: the second call to pg_ctl start -w may succeed instead of
failing if kicked within an interval of 3 seconds after the first one,
based on the tests on my Windows VM. My guess is that this is caused
by the fact that we monitor the shell process and not the postmaster
itself. Thoughts?
Regards,
--
Michael

Attachment

Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Alvaro Herrera
Date:
Michael Paquier wrote:

> > That would be WaitForSingleObject(handle, ms_timeout) ==
> > WAIT_OBJECT_0, no? The handle should be picked up from
> > CreateRestrictedProcess, and I think that CloseHandle should not be
> > called on pi.hProcess if we are going to wait for it afterwards.
> > Reference:
> > https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx
> 
> I had a look at that, and attached is an updated patch showing the
> concept of using an HANDLE that pg_ctl waits for.

I wonder if you are interested in rewriting this whole thing to not use
cmd.exe at all, which as I understand is just about output redirection.
Earlier in the thread I pointed out that on Microsoft docs they claim
that you can redirect stdout/stderr when creating a new process without
going through cmd.exe; see
https://www.postgresql.org/message-id/20150903202114.GG2912%40alvherre.pgsql

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I wonder if you are interested in rewriting this whole thing to not use
> cmd.exe at all, which as I understand is just about output redirection.

FWIW, I have little interest in going there, because I'm afraid that
getting it to be bug-compatible with the existing behavior on points such
as quoting of -o arguments would be extremely fiddly.  Perhaps it'd be
worth doing it in master, but I'd be afraid to back-patch such a change
--- and what I'm after right now is a back-patchable way of getting rid
of the 5-second-timeout behavior.
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
>> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think there is still room to salvage something without fully rewriting
>>> the postmaster invocation logic to avoid using CMD, because it's still
>>> true that checking whether the CMD process is still there should be as
>>> good as checking the postmaster proper.  We just can't use kill() for it.

> I had a look at that, and attached is an updated patch showing the
> concept of using an HANDLE that pg_ctl waits for. I agree that saving
> an HANDLE the way this patch does using a static variable is a bit
> ugly especially knowing that service registration uses
> test_postmaster_connection as well with directly a postmaster. The
> good thing is that tapcheck runs smoothly, with one single failure
> though: the second call to pg_ctl start -w may succeed instead of
> failing if kicked within an interval of 3 seconds after the first one,
> based on the tests on my Windows VM. My guess is that this is caused
> by the fact that we monitor the shell process and not the postmaster
> itself. Thoughts?

Yeah, it can still succeed because pg_ctl can't tell that the
postmaster.pid created by the earlier invocation isn't the one it wants.
It adopts the values out of that file, tests the connection, finds it
works, and declares victory, not realizing that the postmaster *it*
started will soon fail (or maybe already has).

Waiting more than 2 seconds is enough to make sure that
test_postmaster_connection sees the pre-existing pidfile as stale and
doesn't believe that it represents a successful postmaster start.

So there's still something to be desired on Windows; but it's still
better than before in that we can reliably detect child process exit
instead of having to use the five-second heuristic.  And of course on
Unix this is way better than before.

So I've pushed this with some cosmetic adjustments, as well as the not
so cosmetic adjustment of making the service-start path also use handle
testing.  If there are remaining problems, the buildfarm should tell us.

I'm not sure if this will completely fix our problems with "pg_ctl start"
related buildfarm failures on very slow critters.  It does get rid of the
hard wired 5-second timeout, but the 60-second timeout could still be an
issue.  I think Noah was considering a patch to allow that number to be
raised.  I'd be in favor of letting pg_ctl accept a default timeout length
from an environment variable, and then the slower critters could be fixed
by adjusting their buildfarm configurations.
        regards, tom lane



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Michael Paquier
Date:
On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So there's still something to be desired on Windows; but it's still
> better than before in that we can reliably detect child process exit
> instead of having to use the five-second heuristic.  And of course on
> Unix this is way better than before.
>
> So I've pushed this with some cosmetic adjustments, as well as the not
> so cosmetic adjustment of making the service-start path also use handle
> testing.  If there are remaining problems, the buildfarm should tell us.

Thanks! I used at some point when hacking postmasterProcess as well to
save the handle but for clarity as this was already used for shutdown
when running things as a service of the Windows SCM I put that in
another variable. So far so good based on some extra testing I just
did now.

> I'm not sure if this will completely fix our problems with "pg_ctl start"
> related buildfarm failures on very slow critters.  It does get rid of the
> hard wired 5-second timeout, but the 60-second timeout could still be an
> issue.  I think Noah was considering a patch to allow that number to be
> raised.  I'd be in favor of letting pg_ctl accept a default timeout length
> from an environment variable, and then the slower critters could be fixed
> by adjusting their buildfarm configurations.

Being able to pass that as a command-line parameter (master-only
change) would be welcome as well IMO.
-- 
Michael



Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not sure if this will completely fix our problems with "pg_ctl start"
>> related buildfarm failures on very slow critters.  It does get rid of the
>> hard wired 5-second timeout, but the 60-second timeout could still be an
>> issue.  I think Noah was considering a patch to allow that number to be
>> raised.  I'd be in favor of letting pg_ctl accept a default timeout length
>> from an environment variable, and then the slower critters could be fixed
>> by adjusting their buildfarm configurations.

> Being able to pass that as a command-line parameter (master-only
> change) would be welcome as well IMO.

Huh?  The --timeout parameter has always been there.  I'm just expressing
an opinion that modifying all the makefiles and test scripts to pass
through a timeout setting would be too messy.
        regards, tom lane