Thread: Streaming replication failover/failback

Streaming replication failover/failback

From

Israel Brewster

Date:

17 November 2016, 00:51:38

I've been playing around with streaming replication, and discovered that the following series of steps *appears* to work without complaint:

- Start with master on server A, slave on server B, replicating via streaming replication with replication slots.

- Shut down master on A

- Promote slave on B to master

- Create recovery.conf on A pointing to B

- Start (as slave) on A, streaming from B

After those steps, A comes up as a streaming replica of B, and works as expected. In my testing I can go back and forth between the two servers all day using the above steps.

My understanding from my initial research, however, is that this shouldn't be possible - I should need to perform a new basebackup from B to A after promoting B to master before I can restart A as a slave. Is the observed behavior then just a "lucky fluke" that I shouldn't rely on? Or is it expected behavior and my understanding about the need for a new basebackup is simply off? Does the new pg_rewind feature of 9.5 change things? If so, how?

Thanks for your time!

-----------------------------------------------

Israel Brewster

Systems Analyst II

Ravn Alaska

5245 Airport Industrial Rd

Fairbanks, AK 99709

(907) 450-7293

-----------------------------------------------

Attachment

Israel Brewster.vcf

Re: Streaming replication failover/failback

From

Adrian Klaver

Date:

17 November 2016, 01:24:49

On 11/16/2016 04:51 PM, Israel Brewster wrote:
> I've been playing around with streaming replication, and discovered that
> the following series of steps *appears* to work without complaint:
>
> - Start with master on server A, slave on server B, replicating via
> streaming replication with replication slots.
> - Shut down master on A
> - Promote slave on B to master
> - Create recovery.conf on A pointing to B
> - Start (as slave) on A, streaming from B
>
> After those steps, A comes up as a streaming replica of B, and works as
> expected. In my testing I can go back and forth between the two servers
> all day using the above steps.
>
> My understanding from my initial research, however, is that this
> shouldn't be possible - I should need to perform a new basebackup from B
> to A after promoting B to master before I can restart A as a slave. Is
> the observed behavior then just a "lucky fluke" that I shouldn't rely

You don't say how active the database is, but I going to say it is not
active enough for the WAL files on B to go out for scope for A in the
time it takes you to do the switch over.

> on? Or is it expected behavior and my understanding about the need for a
> new basebackup is simply off? Does the new pg_rewind feature of 9.5
> change things? If so, how?
>
> Thanks for your time!
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> -----------------------------------------------
>
>
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Streaming replication failover/failback

From

Jehan-Guillaume de Rorthais

Date:

17 November 2016, 08:39:19

On Wed, 16 Nov 2016 15:51:26 -0900
Israel Brewster <israel@ravnalaska.net> wrote:

> I've been playing around with streaming replication, and discovered that the
> following series of steps *appears* to work without complaint:
>
> - Start with master on server A, slave on server B, replicating via streaming
> replication with replication slots.
> - Shut down master on A
> - Promote slave on B to master
> - Create recovery.conf on A pointing to B
> - Start (as slave) on A, streaming from B
>
> After those steps, A comes up as a streaming replica of B, and works as
> expected. In my testing I can go back and forth between the two servers all
> day using the above steps.
>
> My understanding from my initial research, however, is that this shouldn't be
> possible - I should need to perform a new basebackup from B to A after
> promoting B to master before I can restart A as a slave. Is the observed
> behavior then just a "lucky fluke" that I shouldn't rely on?

No, it's not a "lucky fluke".

See
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459

The only thing you should really pay attention is that the standby was in
Streaming Rep when you instructed the master to shut down, and that it stays
connected until the full stop of the master.

If you really want to check everything, use pg_xlogdump on the standby and make
sure the standby received the "shutdown checkpoint" from the master and wrote
it in its WAL.

> Or is it expected behavior and my understanding about the need for a new
> basebackup is simply off?

This is expected, but taking a new basebackup was a requirement for some time.

> Does the new pg_rewind feature of 9.5 change things? If so, how?

pg_rewind helps if your standby was not connected when you lost/stopped your
master. It reverts the last transactions the master received and that was not
streamed to the promoted standby.

Regards,

Re: Streaming replication failover/failback

From

Israel Brewster

Date:

17 November 2016, 17:27:08

On Nov 16, 2016, at 4:24 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 11/16/2016 04:51 PM, Israel Brewster wrote:
I've been playing around with streaming replication, and discovered that
the following series of steps *appears* to work without complaint:

- Start with master on server A, slave on server B, replicating via
streaming replication with replication slots.
- Shut down master on A
- Promote slave on B to master
- Create recovery.conf on A pointing to B
- Start (as slave) on A, streaming from B

After those steps, A comes up as a streaming replica of B, and works as
expected. In my testing I can go back and forth between the two servers
all day using the above steps.

My understanding from my initial research, however, is that this
shouldn't be possible - I should need to perform a new basebackup from B
to A after promoting B to master before I can restart A as a slave. Is
the observed behavior then just a "lucky fluke" that I shouldn't rely

You don't say how active the database is, but I going to say it is not active enough for the WAL files on B to go out for scope for A in the time it takes you to do the switch over.

Yeah, not very - this was just in testing, so essentially no activity. So between your response and the one from Jehan-Guillaume de Rorthais, what I'm hearing is that my information about the basebackup being needed was obsoleted with the patch he linked to, and as long as I do a clean shutdown of the master, and don't do too much activity on the *new* master before bringing the old master up as a slave (such that WAL files are lost), then the above failover/failback procedure is perfectly fine to rely on in production - I don't have to worry about there being any hidden gotchas like the new slave not *really* replicating or something.

Thanks!

-----------------------------------------------

Israel Brewster

Systems Analyst II

Ravn Alaska

5245 Airport Industrial Rd

Fairbanks, AK 99709

(907) 450-7293

-----------------------------------------------

on? Or is it expected behavior and my understanding about the need for a
new basebackup is simply off? Does the new pg_rewind feature of 9.5
change things? If so, how?

Thanks for your time!
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Streaming replication failover/failback

From

Israel Brewster

Date:

17 November 2016, 17:32:14

On Nov 16, 2016, at 11:39 PM, Jehan-Guillaume de Rorthais <ioguix@free.fr> wrote:

On Wed, 16 Nov 2016 15:51:26 -0900
Israel Brewster <israel@ravnalaska.net> wrote:

I've been playing around with streaming replication, and discovered that the
following series of steps *appears* to work without complaint:

- Start with master on server A, slave on server B, replicating via streaming
replication with replication slots.
- Shut down master on A
- Promote slave on B to master
- Create recovery.conf on A pointing to B
- Start (as slave) on A, streaming from B

After those steps, A comes up as a streaming replica of B, and works as
expected. In my testing I can go back and forth between the two servers all
day using the above steps.

My understanding from my initial research, however, is that this shouldn't be
possible - I should need to perform a new basebackup from B to A after
promoting B to master before I can restart A as a slave. Is the observed
behavior then just a "lucky fluke" that I shouldn't rely on?

No, it's not a "lucky fluke".

See
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459

The only thing you should really pay attention is that the standby was in
Streaming Rep when you instructed the master to shut down, and that it stays
connected until the full stop of the master.

If you really want to check everything, use pg_xlogdump on the standby and make
sure the standby received the "shutdown checkpoint" from the master and wrote
it in its WAL.

Or is it expected behavior and my understanding about the need for a new
basebackup is simply off?

This is expected, but taking a new basebackup was a requirement for some time.

Does the new pg_rewind feature of 9.5 change things? If so, how?

pg_rewind helps if your standby was not connected when you lost/stopped your
master. It reverts the last transactions the master received and that was not
streamed to the promoted standby.

Ah, ok. So kinda an emergency recovery tool then? One step before resorting to backups? In any case, it sounds like it's not something I should need in a *normal* failover scenario, where the master goes down and the slave gets promoted.

Thanks for the information!

-----------------------------------------------

Israel Brewster

Systems Analyst II

Ravn Alaska

5245 Airport Industrial Rd

Fairbanks, AK 99709

(907) 450-7293

-----------------------------------------------

Regards,

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Streaming replication failover/failback

From

Jehan-Guillaume de Rorthais

Date:

18 November 2016, 14:48:17

On Thu, 17 Nov 2016 08:26:59 -0900
Israel Brewster <israel@ravnalaska.net> wrote:

> > On Nov 16, 2016, at 4:24 PM, Adrian Klaver <adrian.klaver@aklaver.com>
> > wrote:
> >
> > On 11/16/2016 04:51 PM, Israel Brewster wrote:
> >> I've been playing around with streaming replication, and discovered that
> >> the following series of steps *appears* to work without complaint:
> >>
> >> - Start with master on server A, slave on server B, replicating via
> >> streaming replication with replication slots.
> >> - Shut down master on A
> >> - Promote slave on B to master
> >> - Create recovery.conf on A pointing to B
> >> - Start (as slave) on A, streaming from B
> >>
> >> After those steps, A comes up as a streaming replica of B, and works as
> >> expected. In my testing I can go back and forth between the two servers
> >> all day using the above steps.
> >>
> >> My understanding from my initial research, however, is that this
> >> shouldn't be possible - I should need to perform a new basebackup from B
> >> to A after promoting B to master before I can restart A as a slave. Is
> >> the observed behavior then just a "lucky fluke" that I shouldn't rely
> >
> > You don't say how active the database is, but I going to say it is not
> > active enough for the WAL files on B to go out for scope for A in the time
> > it takes you to do the switch over.
>
> Yeah, not very - this was just in testing, so essentially no activity. So
> between your response and the one from Jehan-Guillaume de Rorthais, what I'm
> hearing is that my information about the basebackup being needed was
> obsoleted with the patch he linked to, and as long as I do a clean shutdown
> of the master, and don't do too much activity on the *new* master before
> bringing the old master up as a slave (such that WAL files are lost)

Just set up wal archiving to avoid this (and have PITR backup as a side effect).

Re: Streaming replication failover/failback

From

Israel Brewster

Date:

22 November 2016, 16:57:23

On Nov 18, 2016, at 5:48 AM, Jehan-Guillaume de Rorthais <ioguix@free.fr> wrote:

On Thu, 17 Nov 2016 08:26:59 -0900
Israel Brewster <israel@ravnalaska.net> wrote:

On Nov 16, 2016, at 4:24 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 11/16/2016 04:51 PM, Israel Brewster wrote:
I've been playing around with streaming replication, and discovered that
the following series of steps *appears* to work without complaint:

- Start with master on server A, slave on server B, replicating via
streaming replication with replication slots.
- Shut down master on A
- Promote slave on B to master
- Create recovery.conf on A pointing to B
- Start (as slave) on A, streaming from B

After those steps, A comes up as a streaming replica of B, and works as
expected. In my testing I can go back and forth between the two servers
all day using the above steps.

My understanding from my initial research, however, is that this
shouldn't be possible - I should need to perform a new basebackup from B
to A after promoting B to master before I can restart A as a slave. Is
the observed behavior then just a "lucky fluke" that I shouldn't rely

You don't say how active the database is, but I going to say it is not
active enough for the WAL files on B to go out for scope for A in the time
it takes you to do the switch over.

Yeah, not very - this was just in testing, so essentially no activity. So
between your response and the one from Jehan-Guillaume de Rorthais, what I'm
hearing is that my information about the basebackup being needed was
obsoleted with the patch he linked to, and as long as I do a clean shutdown
of the master, and don't do too much activity on the *new* master before
bringing the old master up as a slave (such that WAL files are lost)

Just set up wal archiving to avoid this (and have PITR backup as a side effect).

Good point. Streaming replication may not *need* WAL archiving to work, but having it can provide other benefits than just replication. I'll have to look more into the PITR backup though - that's something that sounds great to have, but I have no clue, beyond the concept, how it works. :-)

-----------------------------------------------

Israel Brewster

Systems Analyst II

Ravn Alaska

5245 Airport Industrial Rd

Fairbanks, AK 99709

(907) 450-7293

-----------------------------------------------

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general