From: johnlumby@hotmail.com
To: pgsql-general-owner@postgresql.org
Subject: how to switch old replication Master to new Standby after promoting old Standby
Date: Mon, 14 Mar 2016 13:23:29 -0400
In a scenario involving replication where no failure occurs but I want to interchange Master <->Standby,
*and* want to avoid making another full base backup and rsyncing it across,
I have found it easy to promote old Standby to new Master with pg_ctl promote,
but very difficult to restart the old Primary as a new Standby *without* performing new base backup.
Assume current 9.5.1 and using streaming replication with a named replication slot if relevant
Second assumption - I am able to temporarily prevent any relational updates to the database before I start the
switchover
e.g. the
default_transaction_read_only = on
setting together with being able to control what transactions do if anything
So I reach a point where both systems have postgresql running without any replication,
both have identical content in all databases (that I can control),
and I am willing to tolerate short restarts if need be and also to scp/rsync the contents of pg_xlog
and other small files but *not* the entire cluster directory or any database base directories.
What do I do next?
Here is what I have found seems to work but I am not sure it is robust:
1. shut down both new Master and intended-to-be-new-Standby
2. on intended-to-be-new-Standby, remove the entire content of pg_xlog and the global/pg_control
3. from new Master , tar + scp the entire content of pg_xlog and the global/pg_control to intended-to-be-new-Standby
4. create intended-to-be-new-Standby's recovery.conf,
specifying recovery_target_timeline = 'latest'
(but I think it works with this setting omitted)
5. start new Master
6. start new Standby and up it comes
in its postgres log(**) I see
LOG: 00000: database system was shut down at 2016-03-14 16:41:24 GMT
LOCATION: StartupXLOG, xlog.c:5936
LOG: 00000: entering standby mode
LOCATION: StartupXLOG, xlog.c:6016
LOG: 00000: consistent recovery state reached at C/480000D0
LOCATION: CheckRecoveryConsistency, xlog.c:7493
LOG: 00000: invalid record length at C/480000D0
LOCATION: ReadRecord, xlog.c:3960
LOG: 00000: started streaming WAL from primary at C/48000000 on timeline 3
LOCATION: WalReceiverMain, walreceiver.c:358
LOG: 00000: redo starts at C/480000D0
LOCATION: StartupXLOG, xlog.c:6700
Is the invalid record length msg anything to worry about?
But this method is purely empirical. Is it robust? Anyone have any better recommendations?
Cheers, John Lumby
(**Note this log was from a pre-release 9.5, 9.5alpha2
I don't have 9.5.1 to hand at present