Streaming Replication Failover - Mailing list pgsql-general

From ning chan
Subject Streaming Replication Failover
Date
Msg-id CAG0k5vDu=qkKBWWa=jiSDxhXk6jww3-vPKHLQYq=aTzq9NcF8w@mail.gmail.com
Whole thread Raw
List pgsql-general
Hi,
I have a cluster of 3 nodes Primary is connected by StandbyA (streaming), Standby A is connected by Standby B (streaming).
I failed over the cluster
1) stop primary
2) promoted StandbyA

Now i see from syslog on Standby B that it is complaining about the timeline mismatch.

Replication Status from Primary
=============================================
|Parameters           |        Value        |
=============================================
|backend_start        | 2013-01-16 23:05:48 |
|pid                  |        17851        |
|usesysid             |          10         |
|usename              |       postgres      |
|application_name     |       StandbyA      |
|client_addr          |     10.89.94.31     |
|client_hostname      |                     |
|client_port          |        43558        |
|state                |      streaming      |
|sent_location        |      0/1EAC3E68     |
|write_location       |      0/1EAC3E68     |
|flush_location       |      0/1EAC3E68     |
|replay_location      |      0/1EAC3E68     |
|sync_priority        |          0          |
|sync_state           |        async        |
=============================================

Replication Status from Standby A
=============================================
|Parameters           |        Value        |
=============================================
|backend_start        | 2013-01-16 23:06:56 |
|pid                  |        12320        |
|usesysid             |          10         |
|usename              |       postgres      |
|application_name     |       StandByB      |
|client_addr          |     10.89.94.29     |
|client_hostname      |                     |
|client_port          |        48214        |
|state                |      streaming      |
|sent_location        |      0/1EAC3E68     |
|write_location       |      0/1EAC3E68     |
|flush_location       |      0/1EAC3E68     |
|replay_location      |      0/1EAC3E68     |
|sync_priority        |          0          |
|sync_state           |        async        |
=============================================

now fail over Primary
On StandByA syslog,
Jan 16 23:08:12 se032c-94-31 postgres[12316]: [3-1] 12316FATAL:  replication terminated by primary server
Jan 16 23:08:12 se032c-94-31 postgres[12312]: [5-1] 12312LOG:  redo starts at 0/1EAC3E68

On StandByB syslog
Jan 16 23:09:48 localhost postgres[3932]: [5-1] LOG:  redo starts at 0/1EAC3E68

Now as soon as I promoted the StandByA,
i see replication between A & B is broken, from StandBy B syslog, it shows the following.
Jan 16 23:11:28 localhost postgres[3945]: [2-1] FATAL:  timeline 15 of the primary does not match recovery target timeline 14

Now my question is while A & B are in sync, why promoting B will break the replication.

To resolve the problem, I need to do stop the engine on B, rsync from A, and start back the B engine.
rsync -a --progress --exclude postgresql.conf --exclude recovery.done --exclude pg_hba.conf root@10.89.94.31:/opt/postgres/9.2/data/* /opt/postgres/9.2/data

Do I need to sync the whole data directory from A? I have a small DB now (2 tables with only few rows). This may take a long time if I have a much larger DB. Any shortcut? Why do i need to do the rync while A & B are originally in sync?

Thanks~
Ning

pgsql-general by date:

Previous
From: Kirk Wythers
Date:
Subject: speeding up a join query that utilizes a view
Next
From: Stuart Bishop
Date:
Subject: Re: plpython intermittent ImportErrors