Re: Switching timeline over streaming replication - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Switching timeline over streaming replication |
Date | |
Msg-id | 50AA6B1B.7020501@vmware.com Whole thread Raw |
In response to | Re: Switching timeline over streaming replication (Amit Kapila <amit.kapila@huawei.com>) |
Responses |
Re: Switching timeline over streaming replication
|
List | pgsql-hackers |
On 10.10.2012 17:54, Thom Brown wrote: > Hmm... I get something different. When I promote standby B, standby > C's log shows: > > LOG: walreceiver ended streaming and awaits new instructions > LOG: re-handshaking at position 0/4000000 on tli 1 > LOG: fetching timeline history file for timeline 2 from primary server > LOG: walreceiver ended streaming and awaits new instructions > LOG: new target timeline is 2 > > Then when I stop then start standby C I get: > > FATAL: timeline history was not contiguous > LOG: startup process (PID 22986) exited with exit code 1 > LOG: aborting startup due to startup process failure Found & fixed this one. A paren was misplaced in tliOfPointInHistory() function.. On 16.11.2012 16:01, Amit Kapila wrote: > The following problems are observed while testing of the patch. > Defect-1: > > 1. start primary A > 2. start standby B following A > 3. start cascade standby C following B. > 4. Promote standby B. > 5. After successful time line switch in cascade standby C, stop C. > 6. Restart C, startup is failing with the following error. > > LOG: database system was shut down in recovery at 2012-11-16 > 16:26:29 IST > FATAL: requested timeline 2 does not contain minimum recovery point > 0/30143A0 on timeline 1 > LOG: startup process (PID 415) exited with exit code 1 > LOG: aborting startup due to startup process failure > > The above defect is already discussed in the following link. > http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@ka > pila@huawei.com Fixed now, sorry for neglecting this earlier. The problem was that if the primary switched to a new timeline at position X, and the standby followed that switch, on restart it would set minRecoveryPoint to X, and the new > Defect-2: > > 1. start primary A > 2. start standby B following A > 3. start cascade standby C following B with 'recovery_target_timeline' > option in > recovery.conf is disabled. > 4. Promote standby B. > 5. Cascade Standby C is not able to follow the new master B because of > timeline difference. > 6. Try to stop the cascade standby C (which is failing and the > server is not stopping, > observations are as WAL Receiver process is still running and > clients are not allowing to connect). > > The defect-2 is happened only once in my test environment, I will try to > reproduce it. Found it. When restarting the streaming, I reused the WALRCV_STARTING state. But if you then exited recovery, WalRcvRunning() would think that the walreceiver is stuck starting up, because it's been longer than 10 seconds since it was launched and it's still in WALRCV_STARTING state, so it put it into WALRCV_STOPPED state. And walreceiver didn't expect to be put into STOPPED state after having started up successfully already. I added a new explicit WALRCV_RESTARTING state to handle that. In addition to the above bug fixes, there's some small changes since last patch version: * I changed the LOG messages printed in various stages a bit, hopefully making it easier to follow what's happening. Feedback is welcome on when and how we should log, and whether some error messages need clarification. * 'ps' display is updated when the walreceiver enters and exits idle mode * Updated pg_controldata and pg_resetxlog to handle the new minRecoveryPointTLI field I added to the control file. * startup process wakes up walsenders at the end of recovery, so that cascading standbys are notified immediately when the timeline changes. That removes some of the delay in the process. - Heikki
Attachment
pgsql-hackers by date: