Thread: Timeline switch problem with streaming replication with 3 nodes
Timeline switch problem with streaming replication with 3 nodes
From
Mads.Tandrup@schneider-electric.com
Date:
Hi All I've set up a 3 postgresql nodes 1 master and 2 slaves. They have been configured for streaming replication with synchronous on. I've set up an virtual IP that points to the current master node. When I kill the master node. The slave that was synchronous gets promoted to master and gets the shared virtual IP But sometimes the other slave don't accept the switch and instead the log on the slave says: 2012-09-24 10:45:06 GMT 4663 FATAL: replication terminated by primary server 2012-09-24 10:45:06 GMT 4662 LOG: record with zero length at 0/200009E8 2012-09-24 10:45:06 GMT 10209 FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "10.216.73.60" and accepting TCP/IP connections on port 5432? 2012-09-24 10:45:11 GMT 10272 FATAL: could not connect to the primary server: FATAL: recovery is still in progress, can't accept WAL streaming connections 2012-09-24 10:45:16 GMT 10326 FATAL: timeline 10 of the primary does not match recovery target timeline 9 2012-09-24 10:45:21 GMT 10388 FATAL: timeline 10 of the primary does not match recovery target timeline 9 2012-09-24 10:45:26 GMT 10451 FATAL: timeline 10 of the primary does not match recovery target timeline 9 ... And it continues to repeat the last line. The new master says: 2012-09-24 10:45:06 GMT 8394 FATAL: replication terminated by primary server 2012-09-24 10:45:06 GMT 8393 LOG: record with zero length at 0/200009E8 2012-09-24 10:45:11 GMT 8393 LOG: trigger file found: /tmp/postgresql_trigger 2012-09-24 10:45:11 GMT 8393 LOG: redo done at 0/20000990 2012-09-24 10:45:11 GMT 8393 LOG: last completed transaction was at log time 2012-09-24 10:45:01.917175+00 2012-09-24 10:45:11 GMT 8393 LOG: selected new timeline ID: 10 2012-09-24 10:45:11 GMT 10741 [unknown] FATAL: recovery is still in progress, can't accept WAL streaming connections 2012-09-24 10:45:12 GMT 8393 LOG: archive recovery complete 2012-09-24 10:45:12 GMT 8391 LOG: database system is ready to accept connections 2012-09-24 10:45:12 GMT 10743 LOG: autovacuum launcher started The recovery.conf is: standby_mode = 'on' primary_conninfo = 'host=10.216.73.60 port=5432 user=root password=onyx application_name=10.216.73.195' recovery_target_timeline = 'latest' trigger_file = '/tmp/postgresql_trigger' I've found a discussion (http://archives.postgresql.org/pgsql-general/2011-12/msg00553.php) on a similar issue a while back. They talk about sharing WAL files as the solution. But I thought that the idea with streaming replication was that I would not need a shared storage. Is that the only solution or is there another way? Best regards, Mads
On Mon, Sep 24, 2012 at 7:37 PM, <Mads.Tandrup@schneider-electric.com> wrote: > I've found a discussion > (http://archives.postgresql.org/pgsql-general/2011-12/msg00553.php) on a > similar issue a while back. They talk about sharing WAL files as the > solution. But I thought that the idea with streaming replication was that I > would not need a shared storage. > > Is that the only solution or is there another way? Things should work if you manually copy across the 0000010.history file from the new master's pg_xlog directory to the slave's. This method isn't documented, but seems to work. I believe the problem is being fixed, by letting the history files be shipped along with the WAL files. http://archives.postgresql.org/pgsql-general/2011-12/msg00456.php -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/