Thread: Failover with a tertiary read-only secondary

Failover with a tertiary read-only secondary

From

"Hammerman, Joseph"

Date:

31 March 2017, 19:24:58

Hi Postgres Admin users list,

We have a PostgreSQL 93 master instance. We have a secondary set up as a hot streaming replica. There is a third machine that is a read only slave copying from a CNAME service DNS record. If the primary is demoted and the secondary is promoted and the CNAME redirected to the secondary, the read only slave will not pick up replication. The error messages in the log are:

2017-03-31 11:08:03 EDT [31062]: [799-1] ( - ) LOG: restarted WAL streaming at 0/9000000 on timeline 1

2017-03-31 11:08:03 EDT [31062]: [800-1] ( - ) LOG: replication terminated by primary server

2017-03-31 11:08:03 EDT [31062]: [801-1] ( - ) DETAIL: End of WAL reached on timeline 1 at 0/9000090

I believe that these messages are being generated because the WAL checkpoints on the promoted secondary are not the same as they were on demoted primary. Is there a method to remediate this? Or do I need to perform a full resync to both the readonly secondary and the demoted primary?

Thanks in advance for any assistance anyone can provide,

Joseph Hammerman

Re: Failover with a tertiary read-only secondary

From

bricklen

Date:

31 March 2017, 21:31:37

On Fri, Mar 31, 2017 at 9:24 AM, Hammerman, Joseph <JosephHammerman@iheartmedia.com> wrote:

We have a PostgreSQL 93 master instance. We have a secondary set up as a hot streaming replica. There is a third machine that is a read only slave copying from a CNAME service DNS record. If the primary is demoted and the secondary is promoted and the CNAME redirected to the secondary, the read only slave will not pick up replication. The error messages in the log are:

2017-03-31 11:08:03 EDT [31062]: [799-1] ( - ) LOG: restarted WAL streaming at 0/9000000 on timeline 1
2017-03-31 11:08:03 EDT [31062]: [800-1] ( - ) LOG: replication terminated by primary server
2017-03-31 11:08:03 EDT [31062]: [801-1] ( - ) DETAIL: End of WAL reached on timeline 1 at 0/9000090

Two questions:

1). After the DNS change, did you restart the downstream replica? (after the upstream was promoted)

2). You might be able to sidestep the issue if you set up cascading replicas, as in the first replica is streaming from the master, and the second replica is streaming from the upstream replica.

Re: Failover with a tertiary read-only secondary

From

"Hammerman, Joseph"

Date:

31 March 2017, 21:35:06

Hi bricklen,

Thanks for the assistance! To answer your questions,

Yes, the behavior persisted following a restart.
Not a bad idea, but that only delays the necessity of a resync until the next failover…. Unless I’m missing something?

Thanks,

Joseph Hammerman

From: bricklen <bricklen@gmail.com>
Date: Friday, March 31, 2017 at 2:31 PM
To: Joseph Hammerman <JosephHammerman@iheartmedia.com>
Cc: "pgsql-admin@postgresql.org" <pgsql-admin@postgresql.org>
Subject: Re: [ADMIN] Failover with a tertiary read-only secondary

On Fri, Mar 31, 2017 at 9:24 AM, Hammerman, Joseph <JosephHammerman@iheartmedia.com> wrote:

We have a PostgreSQL 93 master instance. We have a secondary set up as a hot streaming replica. There is a third machine that is a read only slave copying from a CNAME service DNS record. If the primary is demoted and the secondary is promoted and the CNAME redirected to the secondary, the read only slave will not pick up replication. The error messages in the log are:

2017-03-31 11:08:03 EDT [31062]: [799-1] ( - ) LOG: restarted WAL streaming at 0/9000000 on timeline 1
2017-03-31 11:08:03 EDT [31062]: [800-1] ( - ) LOG: replication terminated by primary server
2017-03-31 11:08:03 EDT [31062]: [801-1] ( - ) DETAIL: End of WAL reached on timeline 1 at 0/9000090

Two questions:

1). After the DNS change, did you restart the downstream replica? (after the upstream was promoted)

2). You might be able to sidestep the issue if you set up cascading replicas, as in the first replica is streaming from the master, and the second replica is streaming from the upstream replica.

Re: Failover with a tertiary read-only secondary

From

bricklen

Date:

31 March 2017, 21:40:12

On Fri, Mar 31, 2017 at 11:35 AM, Hammerman, Joseph <JosephHammerman@iheartmedia.com> wrote:

Not a bad idea, but that only delays the necessity of a resync until the next failover…. Unless I’m missing something?

I've used cascading replication extensively over the past few years and rarely had to resync a downstream replica. The several thousand Postgres clusters I'm administering now are almost exclusively set up with the primary replica streaming from the master and the master shipping WALs to the secondary replica in DR data centre, so I can't test any cascading replication promotions at the moment. My suggestion is to test your replication setup in a cascade and see what happens, I don't expect you'll need to resync. If you do, report back with how you've set up your replication settings.

Re: Failover with a tertiary read-only secondary

From

Jerry Sievers

Date:

01 April 2017, 01:45:58

bricklen <bricklen@gmail.com> writes:

> On Fri, Mar 31, 2017 at 11:35 AM, Hammerman, Joseph <JosephHammerman@iheartmedia.com> wrote:
>
>      1. Not a bad idea, but that only delays the necessity of a resync until the next failoverâ¦. Unless Iâm
missingsomething? 
>
> I've used cascading replication extensively over the past few years and rarely had to resync a downstream replica.
Theseveral thousand Postgres clusters I'm 
> administering now are almost exclusively set up with the primary replica streaming from the master and the master
shippingWALs to the secondary replica in DR data 
> centre, so I can't test any cascading replication promotions at the moment. My suggestion is to test your replication
setupin a cascade and see what happens, I don't 
> expect you'll need to resync. If you do, report back with how you've set up your replication settings.

Repointing tertiary standbys to a promoted peer should not be difficult
if same tertiary standby was at or trailing the log position when new
master promoted... and recovery.conf has recovery_target_timeline=latest.

Also, the new promoted master needs to have been already configured to
archive and wal_level set not minimal on promotion.

In other words, there shall be no break in the WAL stream in regards to
adequate wal level which depends on the tertiary standby's needs.

HTH

>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800