Thread: master in standby mode croaks

master in standby mode croaks

From
Robert Haas
Date:
I discovered tonight that if you shut down a server, create
recovery.conf with standby_mode = 'on', and start it back up again,
you get this:

LOG:  database system was shut down at 2010-03-30 22:34:09 EDT
LOG:  entering standby mode
FATAL:  recovery connections cannot start because the
recovery_connections parameter is disabled on the WAL source server
LOG:  startup process (PID 22980) exited with exit code 1
LOG:  aborting startup due to startup process failure

Now, you might certainly argue that this is a stupid thing to do (my
motivation was to test some stuff) but certainly it's fair to say that
error message is darn misleading, since in fact recovery_connections
was NOT disabled.  I believe this is the same "start up from a shut
down checkpoint" problem that's been discussed previously so I won't
belabor the point other than to say that I still think we need to fix
this.

...Robert


Re: master in standby mode croaks

From
Simon Riggs
Date:
On Tue, 2010-03-30 at 22:40 -0400, Robert Haas wrote:
> I discovered tonight that if you shut down a server, create
> recovery.conf with standby_mode = 'on', and start it back up again,
> you get this:
> 
> LOG:  database system was shut down at 2010-03-30 22:34:09 EDT
> LOG:  entering standby mode
> FATAL:  recovery connections cannot start because the
> recovery_connections parameter is disabled on the WAL source server
> LOG:  startup process (PID 22980) exited with exit code 1
> LOG:  aborting startup due to startup process failure
> 
> Now, you might certainly argue that this is a stupid thing to do (my
> motivation was to test some stuff) but certainly it's fair to say that
> error message is darn misleading, since in fact recovery_connections
> was NOT disabled.  I believe this is the same "start up from a shut
> down checkpoint" problem that's been discussed previously so I won't
> belabor the point other than to say that 

I don't think it is the same thing at all. This is a separate error and
should be rejected as such. 

> I still think we need to fix this.

Agreed, as a separate issue.

-- Simon Riggs           www.2ndQuadrant.com



Re: master in standby mode croaks

From
Robert Haas
Date:
On Apr 1, 2010, at 7:06 PM, Simon Riggs <simon@2ndQuadrant.com> wrote:
> On Tue, 2010-03-30 at 22:40 -0400, Robert Haas wrote:
>> I discovered tonight that if you shut down a server, create
>> recovery.conf with standby_mode = 'on', and start it back up again,
>> you get this:
>>
>> LOG:  database system was shut down at 2010-03-30 22:34:09 EDT
>> LOG:  entering standby mode
>> FATAL:  recovery connections cannot start because the
>> recovery_connections parameter is disabled on the WAL source server
>> LOG:  startup process (PID 22980) exited with exit code 1
>> LOG:  aborting startup due to startup process failure
>>
>> Now, you might certainly argue that this is a stupid thing to do (my
>> motivation was to test some stuff) but certainly it's fair to say
>> that
>> error message is darn misleading, since in fact recovery_connections
>> was NOT disabled.  I believe this is the same "start up from a shut
>> down checkpoint" problem that's been discussed previously so I won't
>> belabor the point other than to say that
>
> I don't think it is the same thing at all. This is a separate error
> and
> should be rejected as such.
>
>> I still think we need to fix this.
>
> Agreed, as a separate issue.

OK, fair enough. I admit I didn't investigate what was causing this.

...Robert

Re: master in standby mode croaks

From
Simon Riggs
Date:
On Fri, 2010-04-02 at 04:51 -0400, Robert Haas wrote:
> On Apr 1, 2010, at 7:06 PM, Simon Riggs <simon@2ndQuadrant.com> wrote:
> > On Tue, 2010-03-30 at 22:40 -0400, Robert Haas wrote:
> >> I discovered tonight that if you shut down a server, create
> >> recovery.conf with standby_mode = 'on', and start it back up again,
> >> you get this:
> >>
> >> LOG:  database system was shut down at 2010-03-30 22:34:09 EDT
> >> LOG:  entering standby mode
> >> FATAL:  recovery connections cannot start because the
> >> recovery_connections parameter is disabled on the WAL source server
> >> LOG:  startup process (PID 22980) exited with exit code 1
> >> LOG:  aborting startup due to startup process failure
> >>
> >> Now, you might certainly argue that this is a stupid thing to do (my
> >> motivation was to test some stuff) but certainly it's fair to say
> >> that
> >> error message is darn misleading, since in fact recovery_connections
> >> was NOT disabled.  I believe this is the same "start up from a shut
> >> down checkpoint" problem that's been discussed previously so I won't
> >> belabor the point other than to say that
> >
> > I don't think it is the same thing at all. This is a separate error
> > and
> > should be rejected as such.

I can't duplicate this error based upon what you have said.

With just standby_mode = 'on' the standby just waits forever, with a ps
message set to 
postgres: startup process   waiting for 000000010000000000000000

That's not very good, but it isn't the error you describe.

-- Simon Riggs           www.2ndQuadrant.com



Re: master in standby mode croaks

From
Robert Haas
Date:
On Apr 1, 2010, at 7:06 PM, Simon Riggs <simon@2ndQuadrant.com> wrote:
> On Tue, 2010-03-30 at 22:40 -0400, Robert Haas wrote:
>> I discovered tonight that if you shut down a server, create
>> recovery.conf with standby_mode = 'on', and start it back up again,
>> you get this:
>>
>> LOG:  database system was shut down at 2010-03-30 22:34:09 EDT
>> LOG:  entering standby mode
>> FATAL:  recovery connections cannot start because the
>> recovery_connections parameter is disabled on the WAL source server
>> LOG:  startup process (PID 22980) exited with exit code 1
>> LOG:  aborting startup due to startup process failure
>>
>> Now, you might certainly argue that this is a stupid thing to do (my
>> motivation was to test some stuff) but certainly it's fair to say
>> that
>> error message is darn misleading, since in fact recovery_connections
>> was NOT disabled.  I believe this is the same "start up from a shut
>> down checkpoint" problem that's been discussed previously so I won't
>> belabor the point other than to say that
>
> I don't think it is the same thing at all. This is a separate error
> and
> should be rejected as such.
>
>> I still think we need to fix this.
>
> Agreed, as a separate issue.

OK, fair enough. I admit I didn't investigate what was causing this.

...Robert


Re: master in standby mode croaks

From
Robert Haas
Date:
On Fri, Apr 2, 2010 at 5:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I can't duplicate this error based upon what you have said.

I fooled around with this some more and I think I know what's going
on.  The error message I received was:

recovery connections cannot start because the recovery_connections
parameter is disabled on the WAL source server

This is generated when !checkPoint.XLogStandbyInfoMode.  That, in
turn, is set on the master to the results of XLogStandbyInfoActive(),
which is defined as XLogRequestRecoveryConnections && XLogIsNeeded().
XLogIsNeeded() is defined as XLogArchivingActive() || (max_wal_senders
> 0), and XLogArchivingActive() is defined as XLogArchiveMode.  So
when you expand it all out, this error message gets triggered when the
following condition does not hold on the master:

XLogRequestRecoveryConnections && (XLogArchiveMode || (max_wal_senders > 0))

So this can fail in either of two ways: (1)
XLogRequestRecoveryConnections (aka recovery_connections) might be
false, which is the situation described in the error message, or (2)
XLogArchiveMode (archive_mode) might be false and at the same time
max_wal_senders might be zero.  As it happens, the default
configuration of the system is recovery_connections = true,
archive_mode = false, max_wal_senders = 0, so with an out-of-the-box
config it fails for the reason that isn't the one described in the
error message.

One possible approach here is to improve the error message, but it
seems to me that having the ability of Hot Standby to run on the slave
partially controlled by three different GUCs is awfully complicated.
I think the root of the problem here is that recovery_connections
controls one behavior on the primary (whether or not we WAL-log
certain information needed for HS) and a completely unrelated behavior
on the standby (whether or not we try to allow read-only backends into
the system).  In 8.4 and prior, it was always the job of archive_mode
to decide whether WAL-logging was needed.  Maybe we should go back to
that and make it an enum:

wal_mode = {standby | archive | off}

...Robert


Re: master in standby mode croaks

From
Simon Riggs
Date:
On Sat, 2010-04-10 at 09:02 -0400, Robert Haas wrote:

> So this can fail in either of two ways

If I understand this correctly, it is unconvincing as a failure mode
since it doesn't follow any of the documented procedures for creating a
standby. There are many ways to screw up that ignore the manual, which
is why the manual exists.

If you can show a full test case, with failure, then I'll follow it
through.

-- Simon Riggs           www.2ndQuadrant.com



Re: master in standby mode croaks

From
Robert Haas
Date:
On Wed, Apr 14, 2010 at 4:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2010-04-10 at 09:02 -0400, Robert Haas wrote:
>
>> So this can fail in either of two ways
>
> If I understand this correctly, it is unconvincing as a failure mode
> since it doesn't follow any of the documented procedures for creating a
> standby. There are many ways to screw up that ignore the manual, which
> is why the manual exists.
>
> If you can show a full test case, with failure, then I'll follow it
> through.

Huh?  If I had done everything correctly, of course I wouldn't have
gotten an error message at all.  Surely the point is that if I do
something wrong, I should get an error message that describes what I
actually did wrong rather than an error message telling me that I did
something wrong which I clearly did not do.

The recent patch to allow starting from a shutdown checkpoint means
that a standby can be created by shutting down the master and taking a
filesystem-level snapshot of the cluster directly, creating
recovery.conf, and firing it up again.  Anyone who does that with the
default postgresql.conf, though, is going to get a message telling
them that they need to change a setting which is already set
correctly.

...Robert


Re: master in standby mode croaks

From
Simon Riggs
Date:
On Wed, 2010-04-14 at 07:07 -0400, Robert Haas wrote:
> On Wed, Apr 14, 2010 at 4:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Sat, 2010-04-10 at 09:02 -0400, Robert Haas wrote:
> >
> >> So this can fail in either of two ways
> >
> > If I understand this correctly, it is unconvincing as a failure mode
> > since it doesn't follow any of the documented procedures for creating a
> > standby. There are many ways to screw up that ignore the manual, which
> > is why the manual exists.
> >
> > If you can show a full test case, with failure, then I'll follow it
> > through.
> 
> Huh?  If I had done everything correctly, of course I wouldn't have
> gotten an error message at all.  Surely the point is that if I do
> something wrong, I should get an error message that describes what I
> actually did wrong rather than an error message telling me that I did
> something wrong which I clearly did not do.

I will change the error message.

> The recent patch to allow starting from a shutdown checkpoint means
> that a standby can be created by shutting down the master and taking a
> filesystem-level snapshot of the cluster directly, creating
> recovery.conf, and firing it up again.  Anyone who does that with the
> default postgresql.conf, though, is going to get a message telling
> them that they need to change a setting which is already set
> correctly.

Why would they do that? I would never claim this supports all use cases,
just the sensible ones.

-- Simon Riggs           www.2ndQuadrant.com



Re: master in standby mode croaks

From
Robert Haas
Date:
On Wed, Apr 14, 2010 at 7:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-04-14 at 07:07 -0400, Robert Haas wrote:
>> On Wed, Apr 14, 2010 at 4:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > On Sat, 2010-04-10 at 09:02 -0400, Robert Haas wrote:
>> >
>> >> So this can fail in either of two ways
>> >
>> > If I understand this correctly, it is unconvincing as a failure mode
>> > since it doesn't follow any of the documented procedures for creating a
>> > standby. There are many ways to screw up that ignore the manual, which
>> > is why the manual exists.
>> >
>> > If you can show a full test case, with failure, then I'll follow it
>> > through.
>>
>> Huh?  If I had done everything correctly, of course I wouldn't have
>> gotten an error message at all.  Surely the point is that if I do
>> something wrong, I should get an error message that describes what I
>> actually did wrong rather than an error message telling me that I did
>> something wrong which I clearly did not do.
>
> I will change the error message.

I gave a good deal of thought to trying to figure out a cleaner
solution to this problem than just changing the error message and
failed.  So let's change the error message.  Of course I'm not quite
sure what we should change it TO, given that the situation is the
result of an interaction between three different GUCs and we have no
way to distinguish which one(s) are the problem.

...Robert


Re: master in standby mode croaks

From
Simon Riggs
Date:
On Sat, 2010-04-17 at 17:44 -0400, Robert Haas wrote:

> > I will change the error message.
> 
> I gave a good deal of thought to trying to figure out a cleaner
> solution to this problem than just changing the error message and
> failed.  So let's change the error message.  Of course I'm not quite
> sure what we should change it TO, given that the situation is the
> result of an interaction between three different GUCs and we have no
> way to distinguish which one(s) are the problem.

"You need all three" covers it. 

-- Simon Riggs           www.2ndQuadrant.com



Re: master in standby mode croaks

From
Robert Haas
Date:
On Sat, Apr 17, 2010 at 6:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2010-04-17 at 17:44 -0400, Robert Haas wrote:
>
>> > I will change the error message.
>>
>> I gave a good deal of thought to trying to figure out a cleaner
>> solution to this problem than just changing the error message and
>> failed.  So let's change the error message.  Of course I'm not quite
>> sure what we should change it TO, given that the situation is the
>> result of an interaction between three different GUCs and we have no
>> way to distinguish which one(s) are the problem.
>
> "You need all three" covers it.

Actually you need standby_connections and either archive_mode=on or
max_wal_senders>0, I think.

...Robert


Re: master in standby mode croaks

From
Fujii Masao
Date:
On Sun, Apr 18, 2010 at 7:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Apr 17, 2010 at 6:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Sat, 2010-04-17 at 17:44 -0400, Robert Haas wrote:
>>
>>> > I will change the error message.
>>>
>>> I gave a good deal of thought to trying to figure out a cleaner
>>> solution to this problem than just changing the error message and
>>> failed.  So let's change the error message.  Of course I'm not quite
>>> sure what we should change it TO, given that the situation is the
>>> result of an interaction between three different GUCs and we have no
>>> way to distinguish which one(s) are the problem.
>>
>> "You need all three" covers it.
>
> Actually you need standby_connections and either archive_mode=on or
> max_wal_senders>0, I think.

Right.

First of all, I wonder why the latter two need to affect the decision of
whether additional information is written to WAL for HS. How about just
removing XLogIsNeeded() condition from XLogStandbyInfoActive()?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: master in standby mode croaks

From
Robert Haas
Date:
On Sun, Apr 18, 2010 at 9:58 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sun, Apr 18, 2010 at 7:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Apr 17, 2010 at 6:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On Sat, 2010-04-17 at 17:44 -0400, Robert Haas wrote:
>>>
>>>> > I will change the error message.
>>>>
>>>> I gave a good deal of thought to trying to figure out a cleaner
>>>> solution to this problem than just changing the error message and
>>>> failed.  So let's change the error message.  Of course I'm not quite
>>>> sure what we should change it TO, given that the situation is the
>>>> result of an interaction between three different GUCs and we have no
>>>> way to distinguish which one(s) are the problem.
>>>
>>> "You need all three" covers it.
>>
>> Actually you need standby_connections and either archive_mode=on or
>> max_wal_senders>0, I think.
>
> Right.
>
> First of all, I wonder why the latter two need to affect the decision of
> whether additional information is written to WAL for HS. How about just
> removing XLogIsNeeded() condition from XLogStandbyInfoActive()?

Bad idea, I think.  If XLogIsNeeded() is returning false and
XLogStandbyInfoActive() is returning true, the resulting WAL will
still be unusable for HS, at least AIUI.

...Robert


Re: master in standby mode croaks

From
Fujii Masao
Date:
On Mon, Apr 19, 2010 at 11:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> First of all, I wonder why the latter two need to affect the decision of
>> whether additional information is written to WAL for HS. How about just
>> removing XLogIsNeeded() condition from XLogStandbyInfoActive()?
>
> Bad idea, I think.  If XLogIsNeeded() is returning false and
> XLogStandbyInfoActive() is returning true, the resulting WAL will
> still be unusable for HS, at least AIUI.

Probably No. Such a WAL will be usable for HS unless an unlogged
operation (e.g., CLUSTER, CREATE TABLE AS SELECT, etc) happens.
I think that the occurrence of an unlogged operation rather than
XLogIsNeeded() itself must be checked in the standby, it's already
been being checked. So just removing XLogIsNeeded() condition makes
sense to me.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: master in standby mode croaks

From
Robert Haas
Date:
On Mon, Apr 19, 2010 at 5:31 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Apr 19, 2010 at 11:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> First of all, I wonder why the latter two need to affect the decision of
>>> whether additional information is written to WAL for HS. How about just
>>> removing XLogIsNeeded() condition from XLogStandbyInfoActive()?
>>
>> Bad idea, I think.  If XLogIsNeeded() is returning false and
>> XLogStandbyInfoActive() is returning true, the resulting WAL will
>> still be unusable for HS, at least AIUI.
>
> Probably No. Such a WAL will be usable for HS unless an unlogged
> operation (e.g., CLUSTER, CREATE TABLE AS SELECT, etc) happens.
> I think that the occurrence of an unlogged operation rather than
> XLogIsNeeded() itself must be checked in the standby, it's already
> been being checked. So just removing XLogIsNeeded() condition makes
> sense to me.

I think that's a bad idea.  Currently we have three possible types of
WAL-logging:

- just enough for crash recovery (archive_mode=off and max_wal_senders=0)
- enough for log-shipping replication (archive_mode=on or
max_wal_senders>0, but recovery_connections=off)
- enough for log-shipping replication + hot standby (archive_mode=on
or max_wal_senders>0, plus recovery_connections=on)

I'm not eager to add a fourth category where hot standby works unless
you do any of the things that break log-streaming in general.  That
seems hopelessly fragile and also fairly pointless.

...Robert