Home > mailing lists
Testing Cascading Replication - Mailing list pgsql-hackers

From	Josh Berkus
Subject	Testing Cascading Replication
Date	June 26, 2013 22:43:04
Msg-id	51CB6E70.7040201@agliodbs.com Whole thread Raw
List	pgsql-hackers
Tree view
Folks,

Wanted to give you the below testing emails from DHAVAL JAISWAL.  He's
been testing 9.3's streaming-only cascading replication, and so far it
works as advertised.  What he found in his tests was:

a) he could not remaster to a former replica which was behind the relica
he was trying to remaster

b) when servers where correctly caught up, remastering worked correctly

So, all good so far.

Text follows

======================

TEST 1: remastering failure due to picking the wrong replica
I have tested below scenario of the cascade replication for postgreSQL 9.3
beta version.
             A
  B.....................E
C...D
 1)   *A is the master,*
    *B & E are pointing to the A, *
    *C & D are pointing to the B.*


*Tested Scenarios are as follows: *
* *

* *


a) When (A) failed, we can able to promote B or E as the master and as
usual C & D would continue to talk with the B, if we have promoted B as the
master. If we have promoted E as the master in that case i have changed
recovery.conf of C & D and replace the port and IP pointing to the E. After
restarting of C & D, it has started to talk with the E.

  b) When (B) failed, I have changed recovery.conf of C & D and replace
the port and IP pointing to the E. After restarting of C & D, it has
started to talk with the E. At last A would be the master, E is pointing to
A and C & D pointing to E.



Now, in a) scenario when we promote B as the master on failure of A, that
time C & D would continue to talk with the B. However, when i am changing
recovery.conf of E by replacing the port and IP of B. it is throwing
following errors.

 cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

cp: cannot stat `/usr/local/arch/00000003.history': No such file or
directory

LOG: entering standby mode

cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

cp: cannot stat `/usr/local/arch/000000020000000000000027': No such file or
directory

cp: cannot stat `/usr/local/arch/000000010000000000000027': No such file or
directory

cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

*FATAL: requested timeline 2 is not a child of this server's history *
* *

*DETAIL: Latest checkpoint is at 0/272DE57C on timeline 1, but in the
history of the requested timeline, the server forked off from that timeline
at 0/272DC548 *
* *

*LOG: startup process (PID 6155) exited with exit code 1 *
* *

LOG: aborting startup due to startup process failure

======================

TEST 2: Remastering success
Structure would be


*                                                        A* *(Master)*
                             *(Slave1)
B........................................E (Slave2)*
                           (Slave3) C.....D (Slave4)

(1)     stopped the *node (A)*

(2)  Following are the snaps of *slave1*  &  *slave2*  after
stopping*node (A)
*

*slave 1*

postgres=# select pg_last_xact_replay_timestamp(); pg_last_xact_replay_timestamp
----------------------------------2013-06-26 12:13:54.056954+05:30                       --------------->
timing
(1 row)

postgres=# select pg_last_xlog_receive_location();pg_last_xlog_receive_location
-------------------------------0/3E000084                                            ---------------->
received wal
(1 row)



*slave 2
*
postgres=# select pg_last_xact_replay_timestamp(); pg_last_xact_replay_timestamp
----------------------------------2013-06-26 12:13:54.056954+05:30            ---------------> timing
(1 row)

postgres=# select pg_last_xlog_receive_location();pg_last_xlog_receive_location
-------------------------------                ---------------->  received
wal0/3E000084
(1 row)




(3)  Following are the logs on *slave1 while stopped node (A)*

FATAL:  could not connect to the primary server: could not connect to
server: Connection refused               Is the server running on host "127.0.0.1" and accepting               TCP/IP
connectionson port 5432?
 



(4) Following are the logs on *slave2 while stopped node (A) *

FATAL:  could not connect to the primary server: could not connect to
server: Connection refused               Is the server running on host "127.0.0.1" and accepting               TCP/IP
connectionson port 5432?
 




(5) Below *logs of slave1, when promoted slave1 as the master.  *

LOG:  received promote request
LOG:  redo done at 0/3E000024
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started



(6) Below logs when changed the recovery.conf of *slave2 and now it is
pointing to the slave1 after restart*.

LOG:  database system was shut down in recovery at 2013-06-26 12:28:49 IST
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/3E000084
LOG:  invalid record length at 0/3E000084
LOG:  database system is ready to accept read only connections
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 0/3E000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/3E000084
LOG:  new target timeline is 2
LOG:  restarted WAL streaming at 0/3E000000 on timeline 2
LOG:  redo starts at 0/3E000084



Now, at this time it has successfully connected to the master and started
working again.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
pgsql-hackers by date:
From: Mark Kirkwood
Date: 26 June 2013, 22:40:41
Subject: Re: Kudos for Reviewers -- straw poll
From: Jeff Janes
Date: 26 June 2013, 22:48:28
Subject: Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Testing Cascading Replication - Mailing list pgsql-hackers

Previous

Next