Re: Switching timeline over streaming replication - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Switching timeline over streaming replication |
Date | |
Msg-id | 00e101cdd3b7$2ff195d0$8fd4c170$@kapila@huawei.com Whole thread Raw |
In response to | Re: Switching timeline over streaming replication (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Switching timeline over streaming replication
|
List | pgsql-hackers |
On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote: > On 05.12.2012 14:32, Amit Kapila wrote: > > On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote: > >> After some diversions to fix bugs and refactor existing code, I've > >> committed a couple of small parts of this patch, which just add some > >> sanity checks to notice incorrect PITR scenarios. Here's a new > >> version of the main patch based on current HEAD. > > > > After testing with the new patch, the following problems are observed. > > > > Defect - 1: > > > > 1. start primary A > > 2. start standby B following A > > 3. start cascade standby C following B. > > 4. start another standby D following C. > > 5. Promote standby B. > > 6. After successful time line switch in cascade standby C& D, > stop D. > > 7. Restart D, Startup is successful and connecting to standby C. > > 8. Stop C. > > 9. Restart C, startup is failing. > > Ok, the error I get in that scenario is: > > C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not > contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05 > 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit > code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to > startup process failure > > > That mismatch causes the error. I'd like to fix this by always treating > the checkpoint record to be part of the new timeline. That feels more > correct. The most straightforward way to implement that would be to peek > at the xlog record before updating replayEndRecPtr and replayEndTLI. If > it's a checkpoint record that changes TLI, set replayEndTLI to the new > timeline before calling the redo-function. But it's a bit of a > modularity violation to peek into the record like that. > > Or we could just revert the sanity check at beginning of recovery that > throws the "requested timeline 2 does not contain minimum recovery point > 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint > record that says "unexpected timeline ID %u in checkpoint record, before > reaching minimum recovery point %X/%X on timeline %u" checks basically > the same thing, but at a later stage. However, the way > minRecoveryPointTLI is updated still seems wrong to me, so I'd like to > fix that. > > I'm thinking of something like the attached (with some more comments > before committing). Thoughts? This has fixed the problem reported. However, I am not able to think will there be any problem if we remove check "requested timeline 2 does not contain minimum recovery point > 0/3023F08 on timeline 1" at beginning of recovery and just update replayEndTLI with ThisTimeLineID? With Regards, Amit Kapila.
pgsql-hackers by date: