Thread: [HACKERS] Reducing pg_ctl's reaction time

[HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

26 June 2017, 04:13:18

I still have a bee in my bonnet about how slow the recovery TAP tests
are, and especially about how low the CPU usage is while they run,
suggesting that a lot of the wall clock time is being expended on
useless sleeps.  Some analysis I did today found some low-hanging fruit
there: a significant part of the time is being spent in pg_ctl waiting
for postmaster start/stop operations.  It waits in quanta of 1 second,
but the postmaster usually starts or stops in much less than that.
(In these tests, most of the shutdown checkpoints have little to do,
so that in many cases the postmaster stops in just a couple of msec.
Startup isn't very many msec either.)

The attached proposed patch adjusts pg_ctl to check every 100msec,
instead of every second, for the postmaster to be done starting or
stopping.  This cuts the runtime of the recovery TAP tests from around
4m30s to around 3m10s on my machine, a good 30% savings.  I experimented
with different check frequencies but there doesn't seem to be anything
to be gained by cutting the delay further (and presumably, it becomes
counterproductive at some point due to pg_ctl chewing too many cycles).

This change probably doesn't offer much real-world benefit, since few
people start/stop their postmasters often, and shutdown checkpoints are
seldom so cheap on production installations.  Still, it doesn't seem
like it could hurt.  The original choice to use once-per-second checks
was made for hardware a lot slower than what most of us use now; I do
not think it's been revisited since the first implementation of pg_ctl
in 1999 (cf 5b912b089).

Barring objections I'd like to push this soon.

            regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 82ac62d..1e6cb69 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
*************** typedef enum
*** 68,73 ****
--- 68,75 ----

  #define DEFAULT_WAIT    60

+ #define WAITS_PER_SEC    10        /* should divide 1000000 evenly */
+
  static bool do_wait = true;
  static int    wait_seconds = DEFAULT_WAIT;
  static bool wait_seconds_arg = false;
*************** test_postmaster_connection(pgpid_t pm_pi
*** 531,537 ****

      connstr[0] = '\0';

!     for (i = 0; i < wait_seconds; i++)
      {
          /* Do we need a connection string? */
          if (connstr[0] == '\0')
--- 533,539 ----

      connstr[0] = '\0';

!     for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
      {
          /* Do we need a connection string? */
          if (connstr[0] == '\0')
*************** test_postmaster_connection(pgpid_t pm_pi
*** 701,724 ****
  #endif

          /* No response, or startup still in process; wait */
! #ifdef WIN32
!         if (do_checkpoint)
          {
!             /*
!              * Increment the wait hint by 6 secs (connection timeout + sleep)
!              * We must do this to indicate to the SCM that our startup time is
!              * changing, otherwise it'll usually send a stop signal after 20
!              * seconds, despite incrementing the checkpoint counter.
!              */
!             status.dwWaitHint += 6000;
!             status.dwCheckPoint++;
!             SetServiceStatus(hStatus, (LPSERVICE_STATUS) &status);
!         }
!         else
  #endif
!             print_msg(".");

!         pg_usleep(1000000);        /* 1 sec */
      }

      /* return result of last call to PQping */
--- 703,730 ----
  #endif

          /* No response, or startup still in process; wait */
!         if (i % WAITS_PER_SEC == 0)
          {
! #ifdef WIN32
!             if (do_checkpoint)
!             {
!                 /*
!                  * Increment the wait hint by 6 secs (connection timeout +
!                  * sleep). We must do this to indicate to the SCM that our
!                  * startup time is changing, otherwise it'll usually send a
!                  * stop signal after 20 seconds, despite incrementing the
!                  * checkpoint counter.
!                  */
!                 status.dwWaitHint += 6000;
!                 status.dwCheckPoint++;
!                 SetServiceStatus(hStatus, (LPSERVICE_STATUS) &status);
!             }
!             else
  #endif
!                 print_msg(".");
!         }

!         pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
      }

      /* return result of last call to PQping */
*************** do_stop(void)
*** 998,1009 ****

          print_msg(_("waiting for server to shut down..."));

!         for (cnt = 0; cnt < wait_seconds; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 print_msg(".");
!                 pg_usleep(1000000); /* 1 sec */
              }
              else
                  break;
--- 1004,1016 ----

          print_msg(_("waiting for server to shut down..."));

!         for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 if (cnt % WAITS_PER_SEC == 0)
!                     print_msg(".");
!                 pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
              }
              else
                  break;
*************** do_restart(void)
*** 1088,1099 ****

          /* always wait for restart */

!         for (cnt = 0; cnt < wait_seconds; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 print_msg(".");
!                 pg_usleep(1000000); /* 1 sec */
              }
              else
                  break;
--- 1095,1107 ----

          /* always wait for restart */

!         for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 if (cnt % WAITS_PER_SEC == 0)
!                     print_msg(".");
!                 pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
              }
              else
                  break;
*************** do_promote(void)
*** 1225,1241 ****
      if (do_wait)
      {
          DBState        state = DB_STARTUP;

          print_msg(_("waiting for server to promote..."));
!         while (wait_seconds > 0)
          {
              state = get_control_dbstate();
              if (state == DB_IN_PRODUCTION)
                  break;

!             print_msg(".");
!             pg_usleep(1000000); /* 1 sec */
!             wait_seconds--;
          }
          if (state == DB_IN_PRODUCTION)
          {
--- 1233,1250 ----
      if (do_wait)
      {
          DBState        state = DB_STARTUP;
+         int            cnt;

          print_msg(_("waiting for server to promote..."));
!         for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
          {
              state = get_control_dbstate();
              if (state == DB_IN_PRODUCTION)
                  break;

!             if (cnt % WAITS_PER_SEC == 0)
!                 print_msg(".");
!             pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
          }
          if (state == DB_IN_PRODUCTION)
          {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Michael Paquier

Date:

26 June 2017, 12:26:57

On Mon, Jun 26, 2017 at 7:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The attached proposed patch adjusts pg_ctl to check every 100msec,
> instead of every second, for the postmaster to be done starting or
> stopping.  This cuts the runtime of the recovery TAP tests from around
> 4m30s to around 3m10s on my machine, a good 30% savings.  I experimented
> with different check frequencies but there doesn't seem to be anything
> to be gained by cutting the delay further (and presumably, it becomes
> counterproductive at some point due to pg_ctl chewing too many cycles).

I see with a 2~3% noise similar improvements on my laptop with
non-parallel tests. That's nice.

+#define WAITS_PER_SEC  10      /* should divide 1000000 evenly */
As a matter of style, you could define 1000000 as well in a variable
and refer to the variable for the division.

> Barring objections I'd like to push this soon.

This also pops up more easily failures with 001_stream_rep.pl without
a patch applied from the other recent thread, so this patch had better
not get in before anything from
https://www.postgresql.org/message-id/8962.1498425057@sss.pgh.pa.us.
-- 
Michael

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 01:15:37

Michael Paquier <michael.paquier@gmail.com> writes:
> On Mon, Jun 26, 2017 at 7:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The attached proposed patch adjusts pg_ctl to check every 100msec,
>> instead of every second, for the postmaster to be done starting or
>> stopping.

>> +#define WAITS_PER_SEC  10      /* should divide 1000000 evenly */

> As a matter of style, you could define 1000000 as well in a variable
> and refer to the variable for the division.

Good idea, done that way.  (My initial thought was to use USECS_PER_SEC
from timestamp.h, but that's declared as int64 which would have
complicated matters, so I just made a new symbol.)

> This also pops up more easily failures with 001_stream_rep.pl without
> a patch applied from the other recent thread, so this patch had better
> not get in before anything from
> https://www.postgresql.org/message-id/8962.1498425057@sss.pgh.pa.us.

Check.  I pushed your fix for that first.

Thanks for the review!
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Jeff Janes

Date:

27 June 2017, 02:06:44

On Mon, Jun 26, 2017 at 12:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael.paquier@gmail.com> writes:
> On Mon, Jun 26, 2017 at 7:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The attached proposed patch adjusts pg_ctl to check every 100msec,
>> instead of every second, for the postmaster to be done starting or
>> stopping.

>> +#define WAITS_PER_SEC 10 /* should divide 1000000 evenly */

> As a matter of style, you could define 1000000 as well in a variable
> and refer to the variable for the division.

Good idea, done that way. (My initial thought was to use USECS_PER_SEC
from timestamp.h, but that's declared as int64 which would have
complicated matters, so I just made a new symbol.)

> This also pops up more easily failures with 001_stream_rep.pl without
> a patch applied from the other recent thread, so this patch had better
> not get in before anything from
> https://www.postgresql.org/message-id/8962.1498425057@sss.pgh.pa.us.

Check. I pushed your fix for that first.

Thanks for the review!

regards, tom lane

The 10 fold increase in log spam during long PITR recoveries is a bit unfortunate.

9153 2017-06-26 12:55:40.243 PDT FATAL: the database system is starting up

9154 2017-06-26 12:55:40.345 PDT FATAL: the database system is starting up

9156 2017-06-26 12:55:40.447 PDT FATAL: the database system is starting up

9157 2017-06-26 12:55:40.550 PDT FATAL: the database system is starting up

...

I can live with it, but could we use an escalating wait time so it slows back down to once a second after a while?

Cheers,

Jeff

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 02:19:16

Jeff Janes <jeff.janes@gmail.com> writes:
> The 10 fold increase in log spam during long PITR recoveries is a bit
> unfortunate.

> 9153  2017-06-26 12:55:40.243 PDT FATAL:  the database system is starting up
> 9154  2017-06-26 12:55:40.345 PDT FATAL:  the database system is starting up
> 9156  2017-06-26 12:55:40.447 PDT FATAL:  the database system is starting up
> 9157  2017-06-26 12:55:40.550 PDT FATAL:  the database system is starting up
> ...

> I can live with it, but could we use an escalating wait time so it slows
> back down to once a second after a while?

Sure, what do you think an appropriate behavior would be?
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

27 June 2017, 02:23:56

On 2017-06-26 16:19:16 -0400, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > The 10 fold increase in log spam during long PITR recoveries is a bit
> > unfortunate.
> 
> > 9153  2017-06-26 12:55:40.243 PDT FATAL:  the database system is starting up
> > 9154  2017-06-26 12:55:40.345 PDT FATAL:  the database system is starting up
> > 9156  2017-06-26 12:55:40.447 PDT FATAL:  the database system is starting up
> > 9157  2017-06-26 12:55:40.550 PDT FATAL:  the database system is starting up
> > ...
> 
> > I can live with it, but could we use an escalating wait time so it slows
> > back down to once a second after a while?
> 
> Sure, what do you think an appropriate behavior would be?

It'd not be unreasonble to check pg_control first, and only after that
indicates readyness check via the protocol. Doesn't quite seem like
something backpatchable tho.

- Andres

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 02:26:00

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-26 16:19:16 -0400, Tom Lane wrote:
>> Sure, what do you think an appropriate behavior would be?

> It'd not be unreasonble to check pg_control first, and only after that
> indicates readyness check via the protocol.

Hm, that's a thought.  The problem here isn't the frequency of checks,
but the log spam.

> Doesn't quite seem like something backpatchable tho.

I didn't back-patch the pg_ctl change anyway, so that's no issue.
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

27 June 2017, 02:33:47

On 2017-06-26 16:26:00 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-06-26 16:19:16 -0400, Tom Lane wrote:
> >> Sure, what do you think an appropriate behavior would be?
> 
> > It'd not be unreasonble to check pg_control first, and only after that
> > indicates readyness check via the protocol.
> 
> Hm, that's a thought.  The problem here isn't the frequency of checks,
> but the log spam.

Right.  I think to deal with hot-standby we'd probably have to add new
state to the control file however. We don't just want to treat the
server as ready once DB_IN_PRODUCTION is reached.

Arguably we could and should improve the logic when the server has
started, right now it's pretty messy because we never treat a standby as
up if hot_standby is disabled...

- Andres

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 02:39:50

I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> It'd not be unreasonble to check pg_control first, and only after that
>> indicates readyness check via the protocol.

> Hm, that's a thought.  The problem here isn't the frequency of checks,
> but the log spam.

Actually, that wouldn't help much as things stand, because you can't
tell from pg_control whether hot standby is active.  Assuming that
we want "pg_ctl start" to come back as soon as connections are allowed,
it'd have to start probing when it sees DB_IN_ARCHIVE_RECOVERY, which
means Jeff still has a problem with long recovery sessions.

We could maybe address that by changing the set of states in pg_control
(or perhaps simpler, adding a "hot standby active" flag there).  That
might have wider consequences than we really want to deal with post-beta1
though.
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 02:49:07

Andres Freund <andres@anarazel.de> writes:
> Arguably we could and should improve the logic when the server has
> started, right now it's pretty messy because we never treat a standby as
> up if hot_standby is disabled...

True.  If you could tell the difference between "HS disabled" and "HS not
enabled yet" from pg_control, that would make pg_ctl's behavior with
cold-standby servers much cleaner.  Maybe it *is* worth messing with the
contents of pg_control at this late hour.

My inclination for the least invasive fix is to leave the DBState
enum alone and add a separate hot-standby state field with three
values (disabled/not-yet-enabled/enabled).  Then pg_ctl would start
probing the postmaster when it saw either DB_IN_PRODUCTION DBstate
or hot-standby-enabled.  (It'd almost not have to probe the postmaster
at all, except there's a race condition that the startup process
will probably change the field a little before the postmaster gets
the word to open the gates.)  On the other hand, if it saw
DB_IN_ARCHIVE_RECOVERY with hot standby disabled, it'd stop waiting.

Any objections to that design sketch?  Do we need to distinguish
between master and slave servers in the when-to-stop-waiting logic?
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

27 June 2017, 02:58:53

Hi,

On 2017-06-26 16:49:07 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Arguably we could and should improve the logic when the server has
> > started, right now it's pretty messy because we never treat a standby as
> > up if hot_standby is disabled...
> 
> True.  If you could tell the difference between "HS disabled" and "HS not
> enabled yet" from pg_control, that would make pg_ctl's behavior with
> cold-standby servers much cleaner.  Maybe it *is* worth messing with the
> contents of pg_control at this late hour.

I'm +0.5.


> My inclination for the least invasive fix is to leave the DBState
> enum alone and add a separate hot-standby state field with three
> values (disabled/not-yet-enabled/enabled).

Yea, that seems sane.


> Then pg_ctl would start
> probing the postmaster when it saw either DB_IN_PRODUCTION DBstate
> or hot-standby-enabled.  (It'd almost not have to probe the postmaster
> at all, except there's a race condition that the startup process
> will probably change the field a little before the postmaster gets
> the word to open the gates.)  On the other hand, if it saw
> DB_IN_ARCHIVE_RECOVERY with hot standby disabled, it'd stop waiting.

It'd be quite possible to address the race-condition by moving the
updating of the control file to postmaster, to the
CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) block. That'd require
updating the control file from postmaster, which'd be somewhat ugly.
Perhaps that indicates that field shouldn't be in pg_control, but in the
pid file?

Greetings,

Andres Freund

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 03:30:30

Andres Freund <andres@anarazel.de> writes:
> It'd be quite possible to address the race-condition by moving the
> updating of the control file to postmaster, to the
> CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) block. That'd require
> updating the control file from postmaster, which'd be somewhat ugly.

No, I don't like that at all.  Has race conditions against updates
coming from the startup process.

> Perhaps that indicates that field shouldn't be in pg_control, but in the
> pid file?

Yeah, that would be a different way to go at it.  The postmaster would
probably just write the state of the hot_standby GUC to the file, and
pg_ctl would have to infer things from there.
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

27 June 2017, 03:34:14

On 2017-06-26 17:30:30 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > It'd be quite possible to address the race-condition by moving the
> > updating of the control file to postmaster, to the
> > CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) block. That'd require
> > updating the control file from postmaster, which'd be somewhat ugly.
> 
> No, I don't like that at all.  Has race conditions against updates
> coming from the startup process.

You'd obviously have to take the appropriate locks.  I think the issue
here is less race conditions, and more that architecturally we'd
interact with shmem too much.

> > Perhaps that indicates that field shouldn't be in pg_control, but in the
> > pid file?
> 
> Yeah, that would be a different way to go at it.  The postmaster would
> probably just write the state of the hot_standby GUC to the file, and
> pg_ctl would have to infer things from there.

I'd actually say we should just mirror the existing
#ifdef USE_SYSTEMD    if (!EnableHotStandby)        sd_notify(0, "READY=1");
#endif
with corresponding pidfile updates - doesn't really seem necessary for
pg_ctl to do more?

Greetings,

Andres Freund

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 03:38:03

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-26 17:30:30 -0400, Tom Lane wrote:
>> No, I don't like that at all.  Has race conditions against updates
>> coming from the startup process.

> You'd obviously have to take the appropriate locks.  I think the issue
> here is less race conditions, and more that architecturally we'd
> interact with shmem too much.

Uh, we are *not* taking any locks in the postmaster.

>> Yeah, that would be a different way to go at it.  The postmaster would
>> probably just write the state of the hot_standby GUC to the file, and
>> pg_ctl would have to infer things from there.

> I'd actually say we should just mirror the existing
> #ifdef USE_SYSTEMD
>         if (!EnableHotStandby)
>             sd_notify(0, "READY=1");
> #endif
> with corresponding pidfile updates - doesn't really seem necessary for
> pg_ctl to do more?

Hm.  Take that a bit further, and we could drop the connection probes
altogether --- just put the whole responsibility on the postmaster to
show in the pidfile whether it's ready for connections or not.
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

27 June 2017, 03:50:18

On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-06-26 17:30:30 -0400, Tom Lane wrote:
> >> No, I don't like that at all.  Has race conditions against updates
> >> coming from the startup process.
> 
> > You'd obviously have to take the appropriate locks.  I think the issue
> > here is less race conditions, and more that architecturally we'd
> > interact with shmem too much.
> 
> Uh, we are *not* taking any locks in the postmaster.

I'm not sure why you're 'Uh'ing, when my my point pretty much is that we
do not want to do so?

> Hm.  Take that a bit further, and we could drop the connection probes
> altogether --- just put the whole responsibility on the postmaster to
> show in the pidfile whether it's ready for connections or not.

Yea, that seems quite appealing, both from an architectural, simplicity,
and log noise perspective. I wonder if there's some added reliability by
the connection probe though? Essentially wondering if it'd be worthwhile
to keep a single connection test at the end. I'm somewhat disinclined
though.

- Andres

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

27 June 2017, 04:13:56

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
>> Hm.  Take that a bit further, and we could drop the connection probes
>> altogether --- just put the whole responsibility on the postmaster to
>> show in the pidfile whether it's ready for connections or not.

> Yea, that seems quite appealing, both from an architectural, simplicity,
> and log noise perspective. I wonder if there's some added reliability by
> the connection probe though? Essentially wondering if it'd be worthwhile
> to keep a single connection test at the end. I'm somewhat disinclined
> though.

I agree --- part of the appeal of this idea is that there could be a net
subtraction of code from pg_ctl.  (I think it wouldn't have to link libpq
anymore at all, though maybe I forgot something.)  And you get rid of a
bunch of can't-connect failure modes, eg kernel packet filter in the way,
which probably outweighs any hypothetical reliability gain from confirming
the postmaster state the old way.

This would mean that v10 pg_ctl could not be used to start/stop older
postmasters, which is flexibility we tried to preserve in the past.
But I see that we already gave that up in quite a few code paths because
of their attempts to read pg_control, so I think it's a concern we can
ignore.

I'll draft something up later.
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

28 June 2017, 00:59:18

I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
>>> Hm.  Take that a bit further, and we could drop the connection probes
>>> altogether --- just put the whole responsibility on the postmaster to
>>> show in the pidfile whether it's ready for connections or not.

>> Yea, that seems quite appealing, both from an architectural, simplicity,
>> and log noise perspective. I wonder if there's some added reliability by
>> the connection probe though? Essentially wondering if it'd be worthwhile
>> to keep a single connection test at the end. I'm somewhat disinclined
>> though.

> I agree --- part of the appeal of this idea is that there could be a net
> subtraction of code from pg_ctl.  (I think it wouldn't have to link libpq
> anymore at all, though maybe I forgot something.)  And you get rid of a
> bunch of can't-connect failure modes, eg kernel packet filter in the way,
> which probably outweighs any hypothetical reliability gain from confirming
> the postmaster state the old way.

Here's a draft patch for that.  I quite like the results --- this seems
way simpler and more reliable than what pg_ctl has done up to now.
However, it's certainly arguable that this is too much change for an
optional post-beta patch.  If we decide that it has to wait for v11,
I'd address Jeff's complaint by hacking the loop behavior in
test_postmaster_connection, which'd be ugly but not many lines of code.

Note that I followed the USE_SYSTEMD patch's lead as to where to report
postmaster state changes.  Arguably, in the standby-server case, we
should not report the postmaster is ready until we've reached consistency.
But that would require additional signaling from the startup process
to the postmaster, so it seems like a separate change if we want it.

Thoughts?

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2874f63..38f534f 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** PostmasterMain(int argc, char *argv[])
*** 1341,1346 ****
--- 1341,1354 ----
  #endif

      /*
+      * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
+      * see what's happening.  Note that all strings written to the status line
+      * must be the same length, per comments for AddToDataDirLockFile().  We
+      * currently make them all 8 bytes, padding with spaces.
+      */
+     AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "starting");
+
+     /*
       * We're ready to rock and roll...
       */
      StartupPID = StartupDataBase();
*************** pmdie(SIGNAL_ARGS)
*** 2608,2613 ****
--- 2616,2624 ----
              Shutdown = SmartShutdown;
              ereport(LOG,
                      (errmsg("received smart shutdown request")));
+
+             /* Report status */
+             AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "stopping");
  #ifdef USE_SYSTEMD
              sd_notify(0, "STOPPING=1");
  #endif
*************** pmdie(SIGNAL_ARGS)
*** 2663,2668 ****
--- 2674,2682 ----
              Shutdown = FastShutdown;
              ereport(LOG,
                      (errmsg("received fast shutdown request")));
+
+             /* Report status */
+             AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "stopping");
  #ifdef USE_SYSTEMD
              sd_notify(0, "STOPPING=1");
  #endif
*************** pmdie(SIGNAL_ARGS)
*** 2727,2732 ****
--- 2741,2749 ----
              Shutdown = ImmediateShutdown;
              ereport(LOG,
                      (errmsg("received immediate shutdown request")));
+
+             /* Report status */
+             AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "stopping");
  #ifdef USE_SYSTEMD
              sd_notify(0, "STOPPING=1");
  #endif
*************** reaper(SIGNAL_ARGS)
*** 2872,2877 ****
--- 2889,2896 ----
              ereport(LOG,
                      (errmsg("database system is ready to accept connections")));

+             /* Report status */
+             AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "ready   ");
  #ifdef USE_SYSTEMD
              sd_notify(0, "READY=1");
  #endif
*************** sigusr1_handler(SIGNAL_ARGS)
*** 5005,5014 ****
          if (XLogArchivingAlways())
              PgArchPID = pgarch_start();

! #ifdef USE_SYSTEMD
          if (!EnableHotStandby)
              sd_notify(0, "READY=1");
  #endif

          pmState = PM_RECOVERY;
      }
--- 5024,5041 ----
          if (XLogArchivingAlways())
              PgArchPID = pgarch_start();

!         /*
!          * If we aren't planning to enter hot standby mode later, treat
!          * RECOVERY_STARTED as meaning we're out of startup, and report status
!          * accordingly.
!          */
          if (!EnableHotStandby)
+         {
+             AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "standby ");
+ #ifdef USE_SYSTEMD
              sd_notify(0, "READY=1");
  #endif
+         }

          pmState = PM_RECOVERY;
      }
*************** sigusr1_handler(SIGNAL_ARGS)
*** 5024,5029 ****
--- 5051,5058 ----
          ereport(LOG,
                  (errmsg("database system is ready to accept read only connections")));

+         /* Report status */
+         AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "ready   ");
  #ifdef USE_SYSTEMD
          sd_notify(0, "READY=1");
  #endif
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 49a6afa..216bcc7 100644
*** a/src/backend/utils/init/miscinit.c
--- b/src/backend/utils/init/miscinit.c
*************** TouchSocketLockFiles(void)
*** 1149,1156 ****
   *
   * Note: because we don't truncate the file, if we were to rewrite a line
   * with less data than it had before, there would be garbage after the last
!  * line.  We don't ever actually do that, so not worth adding another kernel
!  * call to cover the possibility.
   */
  void
  AddToDataDirLockFile(int target_line, const char *str)
--- 1149,1157 ----
   *
   * Note: because we don't truncate the file, if we were to rewrite a line
   * with less data than it had before, there would be garbage after the last
!  * line.  While we could fix that by adding a truncate call, that would make
!  * the file update non-atomic, which we'd rather avoid.  Therefore, callers
!  * should endeavor never to shorten a line once it's been written.
   */
  void
  AddToDataDirLockFile(int target_line, const char *str)
*************** AddToDataDirLockFile(int target_line, co
*** 1193,1211 ****
      srcptr = srcbuffer;
      for (lineno = 1; lineno < target_line; lineno++)
      {
!         if ((srcptr = strchr(srcptr, '\n')) == NULL)
!         {
!             elog(LOG, "incomplete data in \"%s\": found only %d newlines while trying to add line %d",
!                  DIRECTORY_LOCK_FILE, lineno - 1, target_line);
!             close(fd);
!             return;
!         }
!         srcptr++;
      }
      memcpy(destbuffer, srcbuffer, srcptr - srcbuffer);
      destptr = destbuffer + (srcptr - srcbuffer);

      /*
       * Write or rewrite the target line.
       */
      snprintf(destptr, destbuffer + sizeof(destbuffer) - destptr, "%s\n", str);
--- 1194,1219 ----
      srcptr = srcbuffer;
      for (lineno = 1; lineno < target_line; lineno++)
      {
!         char       *eol = strchr(srcptr, '\n');
!
!         if (eol == NULL)
!             break;                /* not enough lines in file yet */
!         srcptr = eol + 1;
      }
      memcpy(destbuffer, srcbuffer, srcptr - srcbuffer);
      destptr = destbuffer + (srcptr - srcbuffer);

      /*
+      * Fill in any missing lines before the target line, in case lines are
+      * added to the file out of order.
+      */
+     for (; lineno < target_line; lineno++)
+     {
+         if (destptr < destbuffer + sizeof(destbuffer))
+             *destptr++ = '\n';
+     }
+
+     /*
       * Write or rewrite the target line.
       */
      snprintf(destptr, destbuffer + sizeof(destbuffer) - destptr, "%s\n", str);
diff --git a/src/bin/pg_ctl/Makefile b/src/bin/pg_ctl/Makefile
index f5ec088..46f30bd 100644
*** a/src/bin/pg_ctl/Makefile
--- b/src/bin/pg_ctl/Makefile
*************** subdir = src/bin/pg_ctl
*** 16,29 ****
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global

- override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
-
  OBJS=    pg_ctl.o $(WIN32RES)

  all: pg_ctl

! pg_ctl: $(OBJS) | submake-libpq submake-libpgport
!     $(CC) $(CFLAGS) $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

  install: all installdirs
      $(INSTALL_PROGRAM) pg_ctl$(X) '$(DESTDIR)$(bindir)/pg_ctl$(X)'
--- 16,27 ----
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global

  OBJS=    pg_ctl.o $(WIN32RES)

  all: pg_ctl

! pg_ctl: $(OBJS) | submake-libpgport
!     $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

  install: all installdirs
      $(INSTALL_PROGRAM) pg_ctl$(X) '$(DESTDIR)$(bindir)/pg_ctl$(X)'
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index ad2a16f..81b276c 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 34,42 ****
  #include "catalog/pg_control.h"
  #include "common/controldata_utils.h"
  #include "getopt_long.h"
- #include "libpq-fe.h"
  #include "miscadmin.h"
- #include "pqexpbuffer.h"

  /* PID can be negative for standalone backend */
  typedef long pgpid_t;
--- 34,40 ----
*************** typedef enum
*** 49,54 ****
--- 47,58 ----
      IMMEDIATE_MODE
  } ShutdownMode;

+ typedef enum
+ {
+     POSTMASTER_READY,
+     POSTMASTER_STILL_STARTING,
+     POSTMASTER_FAILED
+ } WaitPMResult;

  typedef enum
  {
*************** static int    CreateRestrictedProcess(char
*** 147,158 ****
  #endif

  static pgpid_t get_pgpid(bool is_status_request);
! static char **readfile(const char *path);
  static void free_readfile(char **optlines);
  static pgpid_t start_postmaster(void);
  static void read_post_opts(void);

! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
--- 151,162 ----
  #endif

  static pgpid_t get_pgpid(bool is_status_request);
! static char **readfile(const char *path, int *numlines);
  static void free_readfile(char **optlines);
  static pgpid_t start_postmaster(void);
  static void read_post_opts(void);

! static WaitPMResult wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);

  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
*************** get_pgpid(bool is_status_request)
*** 304,312 ****

  /*
   * get the lines from a text file - return NULL if file can't be opened
   */
  static char **
! readfile(const char *path)
  {
      int            fd;
      int            nlines;
--- 308,319 ----

  /*
   * get the lines from a text file - return NULL if file can't be opened
+  *
+  * *numlines is set to the number of line pointers returned; there is
+  * also an additional NULL pointer after the last real line.
   */
  static char **
! readfile(const char *path, int *numlines)
  {
      int            fd;
      int            nlines;
*************** readfile(const char *path)
*** 318,323 ****
--- 325,332 ----
      int            len;
      struct stat statbuf;

+     *numlines = 0;                /* in case of failure or empty file */
+
      /*
       * Slurp the file into memory.
       *
*************** readfile(const char *path)
*** 367,372 ****
--- 376,382 ----

      /* set up the result buffer */
      result = (char **) pg_malloc((nlines + 1) * sizeof(char *));
+     *numlines = nlines;

      /* now split the buffer into lines */
      linebegin = buffer;
*************** start_postmaster(void)
*** 509,515 ****


  /*
!  * Find the pgport and try a connection
   *
   * On Unix, pm_pid is the PID of the just-launched postmaster.  On Windows,
   * it may be the PID of an ancestor shell process, so we can't check the
--- 519,525 ----


  /*
!  * Wait for the postmaster to become ready.
   *
   * On Unix, pm_pid is the PID of the just-launched postmaster.  On Windows,
   * it may be the PID of an ancestor shell process, so we can't check the
*************** start_postmaster(void)
*** 522,689 ****
   * Note that the checkpoint parameter enables a Windows service control
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
! static PGPing
! test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint)
  {
-     PGPing        ret = PQPING_NO_RESPONSE;
-     char        connstr[MAXPGPATH * 2 + 256];
      int            i;

-     /* if requested wait time is zero, return "still starting up" code */
-     if (wait_seconds <= 0)
-         return PQPING_REJECT;
-
-     connstr[0] = '\0';
-
      for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
      {
!         /* Do we need a connection string? */
!         if (connstr[0] == '\0')
!         {
!             /*----------
!              * The number of lines in postmaster.pid tells us several things:
!              *
!              * # of lines
!              *        0    lock file created but status not written
!              *        2    pre-9.1 server, shared memory not created
!              *        3    pre-9.1 server, shared memory created
!              *        5    9.1+ server, ports not opened
!              *        6    9.1+ server, shared memory not created
!              *        7    9.1+ server, shared memory created
!              *
!              * This code does not support pre-9.1 servers.  On Unix machines
!              * we could consider extracting the port number from the shmem
!              * key, but that (a) is not robust, and (b) doesn't help with
!              * finding out the socket directory.  And it wouldn't work anyway
!              * on Windows.
!              *
!              * If we see less than 6 lines in postmaster.pid, just keep
!              * waiting.
!              *----------
!              */
!             char      **optlines;

!             /* Try to read the postmaster.pid file */
!             if ((optlines = readfile(pid_file)) != NULL &&
!                 optlines[0] != NULL &&
!                 optlines[1] != NULL &&
!                 optlines[2] != NULL)
!             {
!                 if (optlines[3] == NULL)
!                 {
!                     /* File is exactly three lines, must be pre-9.1 */
!                     write_stderr(_("\n%s: -w option is not supported when starting a pre-9.1 server\n"),
!                                  progname);
!                     return PQPING_NO_ATTEMPT;
!                 }
!                 else if (optlines[4] != NULL &&
!                          optlines[5] != NULL)
!                 {
!                     /* File is complete enough for us, parse it */
!                     pgpid_t        pmpid;
!                     time_t        pmstart;

!                     /*
!                      * Make sanity checks.  If it's for the wrong PID, or the
!                      * recorded start time is before pg_ctl started, then
!                      * either we are looking at the wrong data directory, or
!                      * this is a pre-existing pidfile that hasn't (yet?) been
!                      * overwritten by our child postmaster.  Allow 2 seconds
!                      * slop for possible cross-process clock skew.
!                      */
!                     pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
!                     pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!                     if (pmstart >= start_time - 2 &&
  #ifndef WIN32
!                         pmpid == pm_pid
  #else
!                     /* Windows can only reject standalone-backend PIDs */
!                         pmpid > 0
  #endif
!                         )
!                     {
!                         /*
!                          * OK, seems to be a valid pidfile from our child.
!                          */
!                         int            portnum;
!                         char       *sockdir;
!                         char       *hostaddr;
!                         char        host_str[MAXPGPATH];
!
!                         /*
!                          * Extract port number and host string to use. Prefer
!                          * using Unix socket if available.
!                          */
!                         portnum = atoi(optlines[LOCK_FILE_LINE_PORT - 1]);
!                         sockdir = optlines[LOCK_FILE_LINE_SOCKET_DIR - 1];
!                         hostaddr = optlines[LOCK_FILE_LINE_LISTEN_ADDR - 1];
!
!                         /*
!                          * While unix_socket_directories can accept relative
!                          * directories, libpq's host parameter must have a
!                          * leading slash to indicate a socket directory.  So,
!                          * ignore sockdir if it's relative, and try to use TCP
!                          * instead.
!                          */
!                         if (sockdir[0] == '/')
!                             strlcpy(host_str, sockdir, sizeof(host_str));
!                         else
!                             strlcpy(host_str, hostaddr, sizeof(host_str));
!
!                         /* remove trailing newline */
!                         if (strchr(host_str, '\n') != NULL)
!                             *strchr(host_str, '\n') = '\0';
!
!                         /* Fail if couldn't get either sockdir or host addr */
!                         if (host_str[0] == '\0')
!                         {
!                             write_stderr(_("\n%s: -w option cannot use a relative socket directory specification\n"),
!                                          progname);
!                             return PQPING_NO_ATTEMPT;
!                         }
!
!                         /*
!                          * Map listen-only addresses to counterparts usable
!                          * for establishing a connection.  connect() to "::"
!                          * or "0.0.0.0" is not portable to OpenBSD 5.0 or to
!                          * Windows Server 2008, and connect() to "::" is
!                          * additionally not portable to NetBSD 6.0.  (Cygwin
!                          * does handle both addresses, though.)
!                          */
!                         if (strcmp(host_str, "*") == 0)
!                             strcpy(host_str, "localhost");
!                         else if (strcmp(host_str, "0.0.0.0") == 0)
!                             strcpy(host_str, "127.0.0.1");
!                         else if (strcmp(host_str, "::") == 0)
!                             strcpy(host_str, "::1");

!                         /*
!                          * We need to set connect_timeout otherwise on Windows
!                          * the Service Control Manager (SCM) will probably
!                          * timeout first.
!                          */
!                         snprintf(connstr, sizeof(connstr),
!                                  "dbname=postgres port=%d host='%s' connect_timeout=5",
!                                  portnum, host_str);
!                     }
                  }
              }
-
-             /*
-              * Free the results of readfile.
-              *
-              * This is safe to call even if optlines is NULL.
-              */
-             free_readfile(optlines);
          }

!         /* If we have a connection string, ping the server */
!         if (connstr[0] != '\0')
!         {
!             ret = PQping(connstr);
!             if (ret == PQPING_OK || ret == PQPING_NO_ATTEMPT)
!                 break;
!         }

          /*
           * Check whether the child postmaster process is still alive.  This
--- 532,599 ----
   * Note that the checkpoint parameter enables a Windows service control
   * manager checkpoint, it's got nothing to do with database checkpoints!!
   */
! static WaitPMResult
! wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint)
  {
      int            i;

      for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
      {
!         char      **optlines;
!         int            numlines;

!         /*
!          * Try to read the postmaster.pid file.  If it's not valid, or if the
!          * status line isn't there yet, just keep waiting.
!          */
!         if ((optlines = readfile(pid_file, &numlines)) != NULL &&
!             numlines >= LOCK_FILE_LINE_PM_STATUS)
!         {
!             /* File is complete enough for us, parse it */
!             pgpid_t        pmpid;
!             time_t        pmstart;

!             /*
!              * Make sanity checks.  If it's for the wrong PID, or the recorded
!              * start time is before pg_ctl started, then either we are looking
!              * at the wrong data directory, or this is a pre-existing pidfile
!              * that hasn't (yet?) been overwritten by our child postmaster.
!              * Allow 2 seconds slop for possible cross-process clock skew.
!              */
!             pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
!             pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
!             if (pmstart >= start_time - 2 &&
  #ifndef WIN32
!                 pmpid == pm_pid
  #else
!             /* Windows can only reject standalone-backend PIDs */
!                 pmpid > 0
  #endif
!                 )
!             {
!                 /*
!                  * OK, seems to be a valid pidfile from our child.  Check the
!                  * status line (this assumes a v10 or later server).
!                  */
!                 char       *pmstatus = optlines[LOCK_FILE_LINE_PM_STATUS - 1];

!                 /* status line may be blank-padded */
!                 if (strncmp(pmstatus, "ready", 5) == 0 ||
!                     strncmp(pmstatus, "standby", 7) == 0)
!                 {
!                     /* postmaster is done starting up */
!                     free_readfile(optlines);
!                     return POSTMASTER_READY;
                  }
              }
          }

!         /*
!          * Free the results of readfile.
!          *
!          * This is safe to call even if optlines is NULL.
!          */
!         free_readfile(optlines);

          /*
           * Check whether the child postmaster process is still alive.  This
*************** test_postmaster_connection(pgpid_t pm_pi
*** 697,710 ****
              int            exitstatus;

              if (waitpid((pid_t) pm_pid, &exitstatus, WNOHANG) == (pid_t) pm_pid)
!                 return PQPING_NO_RESPONSE;
          }
  #else
          if (WaitForSingleObject(postmasterProcess, 0) == WAIT_OBJECT_0)
!             return PQPING_NO_RESPONSE;
  #endif

!         /* No response, or startup still in process; wait */
          if (i % WAITS_PER_SEC == 0)
          {
  #ifdef WIN32
--- 607,620 ----
              int            exitstatus;

              if (waitpid((pid_t) pm_pid, &exitstatus, WNOHANG) == (pid_t) pm_pid)
!                 return POSTMASTER_FAILED;
          }
  #else
          if (WaitForSingleObject(postmasterProcess, 0) == WAIT_OBJECT_0)
!             return POSTMASTER_FAILED;
  #endif

!         /* Startup still in process; wait, printing a dot once per second */
          if (i % WAITS_PER_SEC == 0)
          {
  #ifdef WIN32
*************** test_postmaster_connection(pgpid_t pm_pi
*** 729,736 ****
          pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
      }

!     /* return result of last call to PQping */
!     return ret;
  }


--- 639,646 ----
          pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
      }

!     /* out of patience; report that postmaster is still starting up */
!     return POSTMASTER_STILL_STARTING;
  }


*************** read_post_opts(void)
*** 764,777 ****
          if (ctl_command == RESTART_COMMAND)
          {
              char      **optlines;

!             optlines = readfile(postopts_file);
              if (optlines == NULL)
              {
                  write_stderr(_("%s: could not read file \"%s\"\n"), progname, postopts_file);
                  exit(1);
              }
!             else if (optlines[0] == NULL || optlines[1] != NULL)
              {
                  write_stderr(_("%s: option file \"%s\" must have exactly one line\n"),
                               progname, postopts_file);
--- 674,688 ----
          if (ctl_command == RESTART_COMMAND)
          {
              char      **optlines;
+             int            numlines;

!             optlines = readfile(postopts_file, &numlines);
              if (optlines == NULL)
              {
                  write_stderr(_("%s: could not read file \"%s\"\n"), progname, postopts_file);
                  exit(1);
              }
!             else if (numlines != 1)
              {
                  write_stderr(_("%s: option file \"%s\" must have exactly one line\n"),
                               progname, postopts_file);
*************** do_start(void)
*** 917,944 ****
      {
          print_msg(_("waiting for server to start..."));

!         switch (test_postmaster_connection(pm_pid, false))
          {
!             case PQPING_OK:
                  print_msg(_(" done\n"));
                  print_msg(_("server started\n"));
                  break;
!             case PQPING_REJECT:
                  print_msg(_(" stopped waiting\n"));
                  print_msg(_("server is still starting up\n"));
                  break;
!             case PQPING_NO_RESPONSE:
                  print_msg(_(" stopped waiting\n"));
                  write_stderr(_("%s: could not start server\n"
                                 "Examine the log output.\n"),
                               progname);
                  exit(1);
                  break;
-             case PQPING_NO_ATTEMPT:
-                 print_msg(_(" failed\n"));
-                 write_stderr(_("%s: could not wait for server because of misconfiguration\n"),
-                              progname);
-                 exit(1);
          }
      }
      else
--- 828,850 ----
      {
          print_msg(_("waiting for server to start..."));

!         switch (wait_for_postmaster(pm_pid, false))
          {
!             case POSTMASTER_READY:
                  print_msg(_(" done\n"));
                  print_msg(_("server started\n"));
                  break;
!             case POSTMASTER_STILL_STARTING:
                  print_msg(_(" stopped waiting\n"));
                  print_msg(_("server is still starting up\n"));
                  break;
!             case POSTMASTER_FAILED:
                  print_msg(_(" stopped waiting\n"));
                  write_stderr(_("%s: could not start server\n"
                                 "Examine the log output.\n"),
                               progname);
                  exit(1);
                  break;
          }
      }
      else
*************** do_status(void)
*** 1319,1329 ****
              {
                  char      **optlines;
                  char      **curr_line;

                  printf(_("%s: server is running (PID: %ld)\n"),
                         progname, pid);

!                 optlines = readfile(postopts_file);
                  if (optlines != NULL)
                  {
                      for (curr_line = optlines; *curr_line != NULL; curr_line++)
--- 1225,1236 ----
              {
                  char      **optlines;
                  char      **curr_line;
+                 int            numlines;

                  printf(_("%s: server is running (PID: %ld)\n"),
                         progname, pid);

!                 optlines = readfile(postopts_file, &numlines);
                  if (optlines != NULL)
                  {
                      for (curr_line = optlines; *curr_line != NULL; curr_line++)
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1634,1640 ****
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (test_postmaster_connection(postmasterPID, true) != PQPING_OK)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
--- 1541,1547 ----
      if (do_wait)
      {
          write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
!         if (wait_for_postmaster(postmasterPID, true) != POSTMASTER_READY)
          {
              write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
              pgwin32_SetServiceStatus(SERVICE_STOPPED);
*************** pgwin32_ServiceMain(DWORD argc, LPTSTR *
*** 1655,1661 ****
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * test_postmaster_connection(), so it might not start from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;

--- 1562,1568 ----
              {
                  /*
                   * status.dwCheckPoint can be incremented by
!                  * wait_for_postmaster(), so it might not start from 0.
                   */
                  int            maxShutdownCheckPoint = status.dwCheckPoint + 12;

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 21a7728..0a07a02 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*************** extern char *shared_preload_libraries_st
*** 432,438 ****
  extern char *local_preload_libraries_string;

  /*
!  * As of 9.1, the contents of the data-directory lock file are:
   *
   * line #
   *        1    postmaster PID (or negative of a standalone backend's PID)
--- 432,438 ----
  extern char *local_preload_libraries_string;

  /*
!  * As of Postgres 10, the contents of the data-directory lock file are:
   *
   * line #
   *        1    postmaster PID (or negative of a standalone backend's PID)
*************** extern char *local_preload_libraries_str
*** 441,452 ****
   *        4    port number
   *        5    first Unix socket directory path (empty if none)
   *        6    first listen_address (IP address or "*"; empty if no TCP port)
!  *        7    shared memory key (not present on Windows)
   *
   * Lines 6 and up are added via AddToDataDirLockFile() after initial file
!  * creation.
   *
!  * The socket lock file, if used, has the same contents as lines 1-5.
   */
  #define LOCK_FILE_LINE_PID            1
  #define LOCK_FILE_LINE_DATA_DIR        2
--- 441,455 ----
   *        4    port number
   *        5    first Unix socket directory path (empty if none)
   *        6    first listen_address (IP address or "*"; empty if no TCP port)
!  *        7    shared memory key (empty on Windows)
!  *        8    postmaster status ("starting", "stopping", "ready   ", "standby ")
   *
   * Lines 6 and up are added via AddToDataDirLockFile() after initial file
!  * creation; also, line 5 is initially empty and is changed after the first
!  * Unix socket is opened.
   *
!  * Socket lock file(s), if used, have the same contents as lines 1-5, with
!  * line 5 being their own directory.
   */
  #define LOCK_FILE_LINE_PID            1
  #define LOCK_FILE_LINE_DATA_DIR        2
*************** extern char *local_preload_libraries_str
*** 455,460 ****
--- 458,464 ----
  #define LOCK_FILE_LINE_SOCKET_DIR    5
  #define LOCK_FILE_LINE_LISTEN_ADDR    6
  #define LOCK_FILE_LINE_SHMEM_KEY    7
+ #define LOCK_FILE_LINE_PM_STATUS    8

  extern void CreateDataDirLockFile(bool amPostmaster);
  extern void CreateSocketLockFile(const char *socketfile, bool amPostmaster,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

28 June 2017, 02:05:44

Hi,

On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
> Here's a draft patch for that.  I quite like the results --- this seems
> way simpler and more reliable than what pg_ctl has done up to now.

Yea, I like that too.


> However, it's certainly arguable that this is too much change for an
> optional post-beta patch.

Yea, I think there's a valid case to be made for that. I'm still
inclined to go along with this, it seems we're otherwise just adding
complexity we're going to remove in a bit again.


> If we decide that it has to wait for v11,
> I'd address Jeff's complaint by hacking the loop behavior in
> test_postmaster_connection, which'd be ugly but not many lines of code.

Basically increasing the wait time over time?


> Note that I followed the USE_SYSTEMD patch's lead as to where to report
> postmaster state changes.  Arguably, in the standby-server case, we
> should not report the postmaster is ready until we've reached consistency.
> But that would require additional signaling from the startup process
> to the postmaster, so it seems like a separate change if we want it.

I suspect we're going to want to add more states to this over time, but
as you say, there's no need to hurry.


>      /*
> +     * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
> +     * see what's happening.  Note that all strings written to the status line
> +     * must be the same length, per comments for AddToDataDirLockFile().  We
> +     * currently make them all 8 bytes, padding with spaces.
> +     */
> +    AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "starting");

The 8-space thing in multiple places is a bit ugly.  How about having a
a bunch of constants declared in one place? Alternatively make it
something like: status: $c, where $c is a one character code for the
various states?


> @@ -1149,8 +1149,9 @@ TouchSocketLockFiles(void)
>   *
>   * Note: because we don't truncate the file, if we were to rewrite a line
>   * with less data than it had before, there would be garbage after the last
> - * line.  We don't ever actually do that, so not worth adding another kernel
> - * call to cover the possibility.
> + * line.  While we could fix that by adding a truncate call, that would make
> + * the file update non-atomic, which we'd rather avoid.  Therefore, callers
> + * should endeavor never to shorten a line once it's been written.
>   */
>  void
>  AddToDataDirLockFile(int target_line, const char *str)
> @@ -1193,19 +1194,26 @@ AddToDataDirLockFile(int target_line, co
>      srcptr = srcbuffer;
>      for (lineno = 1; lineno < target_line; lineno++)
>      {
> -        if ((srcptr = strchr(srcptr, '\n')) == NULL)
> -        {
> -            elog(LOG, "incomplete data in \"%s\": found only %d newlines while trying to add line %d",
> -                 DIRECTORY_LOCK_FILE, lineno - 1, target_line);
> -            close(fd);
> -            return;
> -        }
> -        srcptr++;
> +        char       *eol = strchr(srcptr, '\n');
> +
> +        if (eol == NULL)
> +            break;                /* not enough lines in file yet */
> +        srcptr = eol + 1;
>      }
>      memcpy(destbuffer, srcbuffer, srcptr - srcbuffer);
>      destptr = destbuffer + (srcptr - srcbuffer);
>  
>      /*
> +     * Fill in any missing lines before the target line, in case lines are
> +     * added to the file out of order.
> +     */
> +    for (; lineno < target_line; lineno++)
> +    {
> +        if (destptr < destbuffer + sizeof(destbuffer))
> +            *destptr++ = '\n';
> +    }
> +
> +    /*
>       * Write or rewrite the target line.
>       */
>      snprintf(destptr, destbuffer + sizeof(destbuffer) - destptr,
> "%s\n", str);

Not this patches fault, and shouldn't be changed now, but this is a
fairly weird way to manage and update this file.

Greetings,

Andres Freund

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

28 June 2017, 02:26:15

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
>> If we decide that it has to wait for v11,
>> I'd address Jeff's complaint by hacking the loop behavior in
>> test_postmaster_connection, which'd be ugly but not many lines of code.

> Basically increasing the wait time over time?

I was thinking of just dropping back to once-per-second after a couple
of seconds, with something along the lines of this in place of the
existing sleep at the bottom of the loop:
if (i >= 2 * WAITS_PER_SEC){    pg_usleep(USEC_PER_SEC);    i += WAITS_PER_SEC - 1;}else    pg_usleep(USEC_PER_SEC /
WAITS_PER_SEC);


> The 8-space thing in multiple places is a bit ugly.  How about having a
> a bunch of constants declared in one place? Alternatively make it
> something like: status: $c, where $c is a one character code for the
> various states?

Yeah, we could add the string values as macros in miscadmin.h, perhaps.
I don't like the single-character idea --- if we do expand the number of
things reported this way, that could get restrictive.
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

28 June 2017, 23:31:27

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
>> However, it's certainly arguable that this is too much change for an
>> optional post-beta patch.

> Yea, I think there's a valid case to be made for that. I'm still
> inclined to go along with this, it seems we're otherwise just adding
> complexity we're going to remove in a bit again.

I'm not hearing anyone speaking against doing this now, so I'm going
to go ahead with it.

> The 8-space thing in multiple places is a bit ugly.  How about having a
> a bunch of constants declared in one place?

While looking this over again, I got worried about the fact that pg_ctl
is #including "miscadmin.h".  That's a pretty low-level backend header
and it wouldn't be surprising at all if somebody tried to put stuff in
it that wouldn't compile frontend-side.  I think we should take the
opportunity, as long as we're touching this stuff, to split the #defines
that describe the contents of postmaster.pid into a separate header file.
Maybe "utils/pidfile.h" ?
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

28 June 2017, 23:34:57

On 2017-06-28 13:31:27 -0400, Tom Lane wrote:
> I'm not hearing anyone speaking against doing this now, so I'm going
> to go ahead with it.

Cool.


> While looking this over again, I got worried about the fact that pg_ctl
> is #including "miscadmin.h".  That's a pretty low-level backend header
> and it wouldn't be surprising at all if somebody tried to put stuff in
> it that wouldn't compile frontend-side.  I think we should take the
> opportunity, as long as we're touching this stuff, to split the #defines
> that describe the contents of postmaster.pid into a separate header file.
> Maybe "utils/pidfile.h" ?

Yes, that sounds like a valid concern, and solution.

- Andres

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

29 June 2017, 00:29:07

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-28 13:31:27 -0400, Tom Lane wrote:
>> While looking this over again, I got worried about the fact that pg_ctl
>> is #including "miscadmin.h".  That's a pretty low-level backend header
>> and it wouldn't be surprising at all if somebody tried to put stuff in
>> it that wouldn't compile frontend-side.  I think we should take the
>> opportunity, as long as we're touching this stuff, to split the #defines
>> that describe the contents of postmaster.pid into a separate header file.
>> Maybe "utils/pidfile.h" ?

> Yes, that sounds like a valid concern, and solution.

So when I removed the miscadmin.h include, I found out that pg_ctl is
also relying on PG_BACKEND_VERSIONSTR from that file.

There are at least three things we could do here:

1. Give this up as not worth this much trouble.

2. Move PG_BACKEND_VERSIONSTR into pg_config.h to go along with the
other version-related macros.

3. Give PG_BACKEND_VERSIONSTR its very own new header file.

Any preferences?
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Alvaro Herrera

Date:

29 June 2017, 01:44:25

Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:

> So when I removed the miscadmin.h include, I found out that pg_ctl is
> also relying on PG_BACKEND_VERSIONSTR from that file.
> 
> There are at least three things we could do here:
> 
> 1. Give this up as not worth this much trouble.
> 
> 2. Move PG_BACKEND_VERSIONSTR into pg_config.h to go along with the
> other version-related macros.

pg_config.h sounds like a decent enough solution.  It's a bit strange
this hasn't come up before, given that that symbol is used more in
frontend environ than backend.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

29 June 2017, 02:02:03

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> So when I removed the miscadmin.h include, I found out that pg_ctl is
>> also relying on PG_BACKEND_VERSIONSTR from that file.
>> 
>> There are at least three things we could do here:
>> 
>> 1. Give this up as not worth this much trouble.
>> 
>> 2. Move PG_BACKEND_VERSIONSTR into pg_config.h to go along with the
>> other version-related macros.

> pg_config.h sounds like a decent enough solution.  It's a bit strange
> this hasn't come up before, given that that symbol is used more in
> frontend environ than backend.

Right at the moment, what I've done is to stick it into port.h beside
the declaration of find_other_exec, since the existing uses are all
as parameters of find_other_exec[_or_die].  But maybe that's a bit too
expedient.
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Ants Aasma

Date:

29 June 2017, 07:24:00

On Wed, Jun 28, 2017 at 8:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
>>> However, it's certainly arguable that this is too much change for an
>>> optional post-beta patch.
>
>> Yea, I think there's a valid case to be made for that. I'm still
>> inclined to go along with this, it seems we're otherwise just adding
>> complexity we're going to remove in a bit again.
>
> I'm not hearing anyone speaking against doing this now, so I'm going
> to go ahead with it.

+1 for going for it. Besides pg_ctl it would also help cluster
management software. In Patroni we basically reimplement pg_ctl to get
at the started postmaster pid to detect a crash during postmaster
startup earlier and to be able to reliably cancel start. Getting the
current state from the pid file sounds better than having a loop poke
the server with pg_isready.

To introduce feature creep, I would like to see more detailed states
than proposed here. Specifically I'm interested in knowing when
PM_WAIT_BACKENDS ends.

When we lose quorum we restart PostgreSQL as a standby. We use a
regular fast shutdown, but that can take a long time due to the
shutdown checkpoint. The leader lease may run out during this time so
we would have to escalate to immediate shutdown or have a watchdog
fence the node. If we knew that no user backends are left we can let
the shutdown checkpoint complete at leisure without risk for split
brain.

Regards,
Ants Aasma

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Jeff Janes

Date:

29 June 2017, 23:19:46

On Tue, Jun 27, 2017 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
>>> Hm. Take that a bit further, and we could drop the connection probes
>>> altogether --- just put the whole responsibility on the postmaster to
>>> show in the pidfile whether it's ready for connections or not.

>> Yea, that seems quite appealing, both from an architectural, simplicity,
>> and log noise perspective. I wonder if there's some added reliability by
>> the connection probe though? Essentially wondering if it'd be worthwhile
>> to keep a single connection test at the end. I'm somewhat disinclined
>> though.

> I agree --- part of the appeal of this idea is that there could be a net
> subtraction of code from pg_ctl. (I think it wouldn't have to link libpq
> anymore at all, though maybe I forgot something.) And you get rid of a
> bunch of can't-connect failure modes, eg kernel packet filter in the way,
> which probably outweighs any hypothetical reliability gain from confirming
> the postmaster state the old way.

Here's a draft patch for that. I quite like the results --- this seems
way simpler and more reliable than what pg_ctl has done up to now.
However, it's certainly arguable that this is too much change for an
optional post-beta patch.

In the now-committed version of this, the 'pg_ctl start' returns successfully as soon as the server reaches a consistent state. Which is OK, except that it does the same thing when hot_standby=off. When hot_standby=off, I would expect it to wait for the end of recovery before exiting with a success code.

Cheers,

Jeff

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Andres Freund

Date:

29 June 2017, 23:53:13


On June 29, 2017 10:19:46 AM PDT, Jeff Janes <jeff.janes@gmail.com> wrote:
>On Tue, Jun 27, 2017 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> I wrote:
>> > Andres Freund <andres@anarazel.de> writes:
>> >> On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
>> >>> Hm.  Take that a bit further, and we could drop the connection
>probes
>> >>> altogether --- just put the whole responsibility on the
>postmaster to
>> >>> show in the pidfile whether it's ready for connections or not.
>>
>> >> Yea, that seems quite appealing, both from an architectural,
>simplicity,
>> >> and log noise perspective. I wonder if there's some added
>reliability by
>> >> the connection probe though? Essentially wondering if it'd be
>worthwhile
>> >> to keep a single connection test at the end. I'm somewhat
>disinclined
>> >> though.
>>
>> > I agree --- part of the appeal of this idea is that there could be
>a net
>> > subtraction of code from pg_ctl.  (I think it wouldn't have to link
>libpq
>> > anymore at all, though maybe I forgot something.)  And you get rid
>of a
>> > bunch of can't-connect failure modes, eg kernel packet filter in
>the way,
>> > which probably outweighs any hypothetical reliability gain from
>> confirming
>> > the postmaster state the old way.
>>
>> Here's a draft patch for that.  I quite like the results --- this
>seems
>> way simpler and more reliable than what pg_ctl has done up to now.
>> However, it's certainly arguable that this is too much change for an
>> optional post-beta patch.
>
>
>In the now-committed version of this, the 'pg_ctl start' returns
>successfully as soon as the server reaches a consistent state. Which is
>OK,
>except that it does the same thing when hot_standby=off.  When
>hot_standby=off, I would expect it to wait for the end of recovery
>before
>exiting with a success code.

I've not looked at the committed version, but earlier versions had code dealing with that difference; essentially doing
whatyou suggest. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Tom Lane

Date:

30 June 2017, 00:39:09

Jeff Janes <jeff.janes@gmail.com> writes:
> In the now-committed version of this, the 'pg_ctl start' returns
> successfully as soon as the server reaches a consistent state. Which is OK,
> except that it does the same thing when hot_standby=off.  When
> hot_standby=off, I would expect it to wait for the end of recovery before
> exiting with a success code.

Um, won't it be waiting forever with that definition?
        regards, tom lane

Re: [HACKERS] Reducing pg_ctl's reaction time

From

Jeff Janes

Date:

30 June 2017, 00:49:43

On Thu, Jun 29, 2017 at 11:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:
> In the now-committed version of this, the 'pg_ctl start' returns
> successfully as soon as the server reaches a consistent state. Which is OK,
> except that it does the same thing when hot_standby=off. When
> hot_standby=off, I would expect it to wait for the end of recovery before
> exiting with a success code.

Um, won't it be waiting forever with that definition?

regards, tom lane

No, this isn't streaming. It hits the PITR limit (recovery_target_*), or runs out of archived wal, and then it opens for business.

Cheers,

Jeff