Testing Cascading Replication - Mailing list pgsql-hackers
From | Josh Berkus |
---|---|
Subject | Testing Cascading Replication |
Date | |
Msg-id | 51CB6E70.7040201@agliodbs.com Whole thread Raw |
List | pgsql-hackers |
Folks, Wanted to give you the below testing emails from DHAVAL JAISWAL. He's been testing 9.3's streaming-only cascading replication, and so far it works as advertised. What he found in his tests was: a) he could not remaster to a former replica which was behind the relica he was trying to remaster b) when servers where correctly caught up, remastering worked correctly So, all good so far. Text follows ====================== TEST 1: remastering failure due to picking the wrong replica I have tested below scenario of the cascade replication for postgreSQL 9.3 beta version. A B.....................E C...D 1) *A is the master,* *B & E are pointing to the A, * *C & D are pointing to the B.* *Tested Scenarios are as follows: * * * * * a) When (A) failed, we can able to promote B or E as the master and as usual C & D would continue to talk with the B, if we have promoted B as the master. If we have promoted E as the master in that case i have changed recovery.conf of C & D and replace the port and IP pointing to the E. After restarting of C & D, it has started to talk with the E. b) When (B) failed, I have changed recovery.conf of C & D and replace the port and IP pointing to the E. After restarting of C & D, it has started to talk with the E. At last A would be the master, E is pointing to A and C & D pointing to E. Now, in a) scenario when we promote B as the master on failure of A, that time C & D would continue to talk with the B. However, when i am changing recovery.conf of E by replacing the port and IP of B. it is throwing following errors. cp: cannot stat `/usr/local/arch/00000002.history': No such file or directory cp: cannot stat `/usr/local/arch/00000003.history': No such file or directory LOG: entering standby mode cp: cannot stat `/usr/local/arch/00000002.history': No such file or directory cp: cannot stat `/usr/local/arch/000000020000000000000027': No such file or directory cp: cannot stat `/usr/local/arch/000000010000000000000027': No such file or directory cp: cannot stat `/usr/local/arch/00000002.history': No such file or directory *FATAL: requested timeline 2 is not a child of this server's history * * * *DETAIL: Latest checkpoint is at 0/272DE57C on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/272DC548 * * * *LOG: startup process (PID 6155) exited with exit code 1 * * * LOG: aborting startup due to startup process failure ====================== TEST 2: Remastering success Structure would be * A* *(Master)* *(Slave1) B........................................E (Slave2)* (Slave3) C.....D (Slave4) (1) stopped the *node (A)* (2) Following are the snaps of *slave1* & *slave2* after stopping*node (A) * *slave 1* postgres=# select pg_last_xact_replay_timestamp(); pg_last_xact_replay_timestamp ----------------------------------2013-06-26 12:13:54.056954+05:30 ---------------> timing (1 row) postgres=# select pg_last_xlog_receive_location();pg_last_xlog_receive_location -------------------------------0/3E000084 ----------------> received wal (1 row) *slave 2 * postgres=# select pg_last_xact_replay_timestamp(); pg_last_xact_replay_timestamp ----------------------------------2013-06-26 12:13:54.056954+05:30 ---------------> timing (1 row) postgres=# select pg_last_xlog_receive_location();pg_last_xlog_receive_location ------------------------------- ----------------> received wal0/3E000084 (1 row) (3) Following are the logs on *slave1 while stopped node (A)* FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "127.0.0.1" and accepting TCP/IP connectionson port 5432? (4) Following are the logs on *slave2 while stopped node (A) * FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "127.0.0.1" and accepting TCP/IP connectionson port 5432? (5) Below *logs of slave1, when promoted slave1 as the master. * LOG: received promote request LOG: redo done at 0/3E000024 LOG: selected new timeline ID: 2 LOG: archive recovery complete LOG: database system is ready to accept connections LOG: autovacuum launcher started (6) Below logs when changed the recovery.conf of *slave2 and now it is pointing to the slave1 after restart*. LOG: database system was shut down in recovery at 2013-06-26 12:28:49 IST LOG: entering standby mode LOG: consistent recovery state reached at 0/3E000084 LOG: invalid record length at 0/3E000084 LOG: database system is ready to accept read only connections LOG: fetching timeline history file for timeline 2 from primary server LOG: started streaming WAL from primary at 0/3E000000 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 at 0/3E000084 LOG: new target timeline is 2 LOG: restarted WAL streaming at 0/3E000000 on timeline 2 LOG: redo starts at 0/3E000084 Now, at this time it has successfully connected to the master and started working again. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
pgsql-hackers by date: