Re: Switching timeline over streaming replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Switching timeline over streaming replication
Date
Msg-id 00e101cdd3b7$2ff195d0$8fd4c170$@kapila@huawei.com
Whole thread Raw
In response to Re: Switching timeline over streaming replication  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Switching timeline over streaming replication
List pgsql-hackers
On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:
> On 05.12.2012 14:32, Amit Kapila wrote:
> > On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:
> >> After some diversions to fix bugs and refactor existing code, I've
> >> committed a couple of small parts of this patch, which just add some
> >> sanity checks to notice incorrect PITR scenarios. Here's a new
> >> version of the main patch based on current HEAD.
> >
> > After testing with the new patch, the following problems are observed.
> >
> > Defect - 1:
> >
> >      1. start primary A
> >      2. start standby B following A
> >      3. start cascade standby C following B.
> >      4. start another standby D following C.
> >      5. Promote standby B.
> >      6. After successful time line switch in cascade standby C&  D,
> stop D.
> >      7. Restart D, Startup is successful and connecting to standby C.
> >      8. Stop C.
> >      9. Restart C, startup is failing.
> 
> Ok, the error I get in that scenario is:
> 
> C 2012-12-05 19:55:43.840 EET 9283 FATAL:  requested timeline 2 does not
> contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
> 19:55:43.841 EET 9282 LOG:  startup process (PID 9283) exited with exit
> code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG:  aborting startup due to
> startup process failure
> 

> 
> That mismatch causes the error. I'd like to fix this by always treating
> the checkpoint record to be part of the new timeline. That feels more
> correct. The most straightforward way to implement that would be to peek
> at the xlog record before updating replayEndRecPtr and replayEndTLI. If
> it's a checkpoint record that changes TLI, set replayEndTLI to the new
> timeline before calling the redo-function. But it's a bit of a
> modularity violation to peek into the record like that.
> 
> Or we could just revert the sanity check at beginning of recovery that
> throws the "requested timeline 2 does not contain minimum recovery point
> 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
> record that says "unexpected timeline ID %u in checkpoint record, before
> reaching minimum recovery point %X/%X on timeline %u" checks basically
> the same thing, but at a later stage. However, the way
> minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
> fix that.
> 
> I'm thinking of something like the attached (with some more comments
> before committing). Thoughts?

This has fixed the problem reported.
However, I am not able to think will there be any problem if we remove check
"requested timeline 2 does not contain minimum recovery point
> 0/3023F08 on timeline 1" at beginning of recovery and just update
replayEndTLI with ThisTimeLineID?

With Regards,
Amit Kapila.




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Commits 8de72b and 5457a1 (COPY FREEZE)
Next
From: Simon Riggs
Date:
Subject: Re: Commits 8de72b and 5457a1 (COPY FREEZE)