Thread: Timeline switch problem with streaming replication with 3 nodes

Timeline switch problem with streaming replication with 3 nodes

From
Mads.Tandrup@schneider-electric.com
Date:
Hi All

I've set up a 3 postgresql nodes 1 master and 2 slaves. They have been
configured for streaming replication with synchronous on. I've set up an
virtual IP that points to the current master node.

When I kill the master node. The slave that was synchronous gets promoted
to master and gets the shared virtual IP

But sometimes the other slave don't accept the switch and instead the log
on the slave says:

2012-09-24 10:45:06 GMT 4663  FATAL:  replication terminated by primary
server
2012-09-24 10:45:06 GMT 4662  LOG:  record with zero length at 0/200009E8
2012-09-24 10:45:06 GMT 10209  FATAL:  could not connect to the primary
server: could not connect to server: Connection refused
                Is the server running on host "10.216.73.60" and accepting
                TCP/IP connections on port 5432?

2012-09-24 10:45:11 GMT 10272  FATAL:  could not connect to the primary
server: FATAL:  recovery is still in progress, can't accept WAL streaming
connections

2012-09-24 10:45:16 GMT 10326  FATAL:  timeline 10 of the primary does not
match recovery target timeline 9
2012-09-24 10:45:21 GMT 10388  FATAL:  timeline 10 of the primary does not
match recovery target timeline 9
2012-09-24 10:45:26 GMT 10451  FATAL:  timeline 10 of the primary does not
match recovery target timeline 9
...

And it continues to repeat the last line.

The new master says:
2012-09-24 10:45:06 GMT 8394  FATAL:  replication terminated by primary
server
2012-09-24 10:45:06 GMT 8393  LOG:  record with zero length at 0/200009E8
2012-09-24 10:45:11 GMT 8393  LOG:  trigger file
found: /tmp/postgresql_trigger
2012-09-24 10:45:11 GMT 8393  LOG:  redo done at 0/20000990
2012-09-24 10:45:11 GMT 8393  LOG:  last completed transaction was at log
time 2012-09-24 10:45:01.917175+00
2012-09-24 10:45:11 GMT 8393  LOG:  selected new timeline ID: 10
2012-09-24 10:45:11 GMT 10741 [unknown] FATAL:  recovery is still in
progress, can't accept WAL streaming connections
2012-09-24 10:45:12 GMT 8393  LOG:  archive recovery complete
2012-09-24 10:45:12 GMT 8391  LOG:  database system is ready to accept
connections
2012-09-24 10:45:12 GMT 10743  LOG:  autovacuum launcher started

The recovery.conf is:
standby_mode = 'on'
primary_conninfo = 'host=10.216.73.60  port=5432 user=root password=onyx
application_name=10.216.73.195'
recovery_target_timeline = 'latest'
trigger_file = '/tmp/postgresql_trigger'

I've found a discussion
(http://archives.postgresql.org/pgsql-general/2011-12/msg00553.php) on a
similar issue a while back. They talk about sharing WAL files as the
solution. But I thought that the idea with streaming replication was that I
would not need a shared storage.

Is that the only solution or is there another way?

Best regards,
Mads



Re: Timeline switch problem with streaming replication with 3 nodes

From
Stuart Bishop
Date:
On Mon, Sep 24, 2012 at 7:37 PM,  <Mads.Tandrup@schneider-electric.com> wrote:

> I've found a discussion
> (http://archives.postgresql.org/pgsql-general/2011-12/msg00553.php) on a
> similar issue a while back. They talk about sharing WAL files as the
> solution. But I thought that the idea with streaming replication was that I
> would not need a shared storage.
>
> Is that the only solution or is there another way?

Things should work if you manually copy across the 0000010.history
file from the new master's pg_xlog directory to the slave's.

This method isn't documented, but seems to work. I believe the problem
is being fixed, by letting the history files be shipped along with the
WAL files.

http://archives.postgresql.org/pgsql-general/2011-12/msg00456.php

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/