Thread: Streaming replication and triggering failover

Streaming replication and triggering failover

From
Heikki Linnakangas
Date:
The trigger file logic feels a bit backwards. As the patch stands, when
the standby starts up, it retries connecting to the master server
indefinitely, until a connection is successfully established. Then it
streams until the connection breaks. If the connection is dropped
abruptly, because of a network problem or crash in the master, standby
retries indefinitely.

If master is shut down cleanly, standby gets out of recovery mode, and
starts up. Unless the trigger file is present; if it is, standby waits
for it to go away before finishing recovery.

So the trigger file is really a "holdoff file", like a safety catch on a
gun. At the very least it should be renamed, but I don't think that's a
very useful behavior anyway.

It doesn't seem wise to consider a clean shutdown of the master as a
signal to trigger failover. If you're setting up a HA system, that by
itself is not robust enough; you also need to trigger failover if the
master goes down unexpectedly, or if the standby was disconnected for
some reason when the master was shut down. Secondly, what if you want to
restart the master server, without initiating failover? You'll have to
restart the standby too, to have it reconnect.

Let's have a default of no failover, and retry connecting to the master
indefinitely. When you *do* want to fail over, create the trigger file.
When the standby sees the trigger file, it should stop streaming, finish
up replaying what it had streamed up to that point, and start up as new
master.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Streaming replication and triggering failover

From
Magnus Hagander
Date:
On Fri, Jan 8, 2010 at 10:58, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The trigger file logic feels a bit backwards. As the patch stands, when
> the standby starts up, it retries connecting to the master server
> indefinitely, until a connection is successfully established. Then it
> streams until the connection breaks. If the connection is dropped
> abruptly, because of a network problem or crash in the master, standby
> retries indefinitely.
>
> If master is shut down cleanly, standby gets out of recovery mode, and
> starts up. Unless the trigger file is present; if it is, standby waits
> for it to go away before finishing recovery.
>
> So the trigger file is really a "holdoff file", like a safety catch on a
> gun. At the very least it should be renamed, but I don't think that's a
> very useful behavior anyway.
>
> It doesn't seem wise to consider a clean shutdown of the master as a
> signal to trigger failover. If you're setting up a HA system, that by
> itself is not robust enough; you also need to trigger failover if the
> master goes down unexpectedly, or if the standby was disconnected for
> some reason when the master was shut down. Secondly, what if you want to
> restart the master server, without initiating failover? You'll have to
> restart the standby too, to have it reconnect.
>
> Let's have a default of no failover, and retry connecting to the master
> indefinitely. When you *do* want to fail over, create the trigger file.
> When the standby sees the trigger file, it should stop streaming, finish
> up replaying what it had streamed up to that point, and start up as new
> master.

+1.

The default should be to "maintain the replication cluster", if
nothing else then by principle of least surprise.

It would also agree with a well-established procedure, which is what
pg_standby does. Keeping the same basic behavior around something like
this can only be a good thing.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Streaming replication and triggering failover

From
Heikki Linnakangas
Date:
Magnus Hagander wrote:
> On Fri, Jan 8, 2010 at 10:58, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> So the trigger file is really a "holdoff file", like a safety catch on a
>> gun. At the very least it should be renamed, but I don't think that's a
>> very useful behavior anyway.
>>
>> It doesn't seem wise to consider a clean shutdown of the master as a
>> signal to trigger failover. If you're setting up a HA system, that by
>> itself is not robust enough; you also need to trigger failover if the
>> master goes down unexpectedly, or if the standby was disconnected for
>> some reason when the master was shut down. Secondly, what if you want to
>> restart the master server, without initiating failover? You'll have to
>> restart the standby too, to have it reconnect.
>>
>> Let's have a default of no failover, and retry connecting to the master
>> indefinitely. When you *do* want to fail over, create the trigger file.
>> When the standby sees the trigger file, it should stop streaming, finish
>> up replaying what it had streamed up to that point, and start up as new
>> master.
> 
> +1.
> 
> The default should be to "maintain the replication cluster", if
> nothing else then by principle of least surprise.
> 
> It would also agree with a well-established procedure, which is what
> pg_standby does. Keeping the same basic behavior around something like
> this can only be a good thing.

Thinking more clearly, my comment above about the trigger file logic
being backwards was bollocks; if the master is shut down, standby waits
for the trigger file to appear, not to go away. And creating the trigger
file during replication causes it to finish, and failover to happen.

Nevertheless, let's make the default "no failover" if no trigger file
location is configured, and remove the notion that normal shutdown of
master stops recovery.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Streaming replication and triggering failover

From
Fujii Masao
Date:
On Fri, Jan 8, 2010 at 7:41 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Thinking more clearly, my comment above about the trigger file logic
> being backwards was bollocks; if the master is shut down, standby waits
> for the trigger file to appear, not to go away. And creating the trigger
> file during replication causes it to finish, and failover to happen.
>
> Nevertheless, let's make the default "no failover" if no trigger file
> location is configured, and remove the notion that normal shutdown of
> master stops recovery.

You dropped CheckForStandbyTrigger() called at the end of recovery.
I think that this would be problem when an invalid record is found before
we reaches a streaming recovery state. The standby would be out-of-control
of the clusterware, and be brought up. Which might cause a split-brain
syndrome. We should need something to prevent such unexpected
activation?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication and triggering failover

From
Heikki Linnakangas
Date:
Fujii Masao wrote:
> You dropped CheckForStandbyTrigger() called at the end of recovery.
> I think that this would be problem when an invalid record is found before
> we reaches a streaming recovery state. The standby would be out-of-control
> of the clusterware, and be brought up. Which might cause a split-brain
> syndrome. We should need something to prevent such unexpected
> activation?

I modified ReadRecord to PANIC if an invalid record is found during
streaming recovery.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Streaming replication and triggering failover

From
Fujii Masao
Date:
On Fri, Jan 8, 2010 at 10:31 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Fujii Masao wrote:
>> You dropped CheckForStandbyTrigger() called at the end of recovery.
>> I think that this would be problem when an invalid record is found before
>> we reaches a streaming recovery state. The standby would be out-of-control
>> of the clusterware, and be brought up. Which might cause a split-brain
>> syndrome. We should need something to prevent such unexpected
>> activation?
>
> I modified ReadRecord to PANIC if an invalid record is found during
> streaming recovery.

Oh, sorry. It was my misunderstanding :(

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming replication and triggering failover

From
Simon Riggs
Date:
On Fri, 2010-01-08 at 12:41 +0200, Heikki Linnakangas wrote:

> let's make the default "no failover" if no trigger file
> location is configured, and remove the notion that normal shutdown of
> master stops recovery.

+1

-- Simon Riggs           www.2ndQuadrant.com