Thread: Mult-standby streaming replication master failover

Mult-standby streaming replication master failover

From

Vladislav Tchernev

Date:

09 February 2015, 19:26:57

Hello All,

I am trying to figure out a clear way to preform a post-disaster (where the master hasn't had the chance to shut down properly) master failover in a multi-standby streaming replication setup. I found a couple of old posts suggesting that the choice of the new master is very important and if not done correctly might lead to database corruption.The posts suggest that the first thing to do is to find the most caught-up replica, by compare WAL replay locations among all standby servers. This makes perfect scene but also makes the failover process a little bit more complicated and difficult to automate. I guess my question is, does my understanding is right and if there are any changes in this regard in the newer Postgresql 9.3/9.4 releases.

Best,

Vlad

Re: Mult-standby streaming replication master failover

From

Michael Paquier

Date:

10 February 2015, 03:09:32

On Tue, Feb 10, 2015 at 1:26 AM, Vladislav Tchernev <vtchernev@gmail.com> wrote:
> I am trying to figure out a clear way to preform a post-disaster (where the
> master hasn't had the chance to shut down properly) master failover in a
> multi-standby streaming replication setup. I found a couple of old posts
> suggesting that the choice of the new master is very important and if not
> done correctly might lead to database corruption.The posts suggest that the
> first thing to do is to find the most caught-up replica, by compare WAL
> replay locations among all standby servers. This makes perfect scene but
> also makes the failover process a little bit more complicated and difficult
> to automate. I guess my question is, does my understanding is right and if
> there are any changes in this regard in the newer Postgresql 9.3/9.4
> releases.

You are right to take this approach. One of the modifications that 9.3
brought is that a standby is able to follow a timeline switch, so by
promoting the most-updated standby all the other existing standbys
will be able to follow it just by redirecting their primary_conninfo
flow to the newly-promoted node. For the old master, you may as well
rewind it with pg_rewind if WAL forked (assuming that page checksum is
enabled for 9.3~ or that wal_log_hints is enabled in 9.4~).
--
Michael