Thread: Use pg_rewind when target timeline was switched

Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
Hackers,

attached patch allows pg_rewind to work when target timeline was switched. Actually, this patch fixes TODO from pg_rewind comments.

  /*
   * Trace the history backwards, until we hit the target timeline.
   *
   * TODO: This assumes that there are no timeline switches on the target
   * cluster after the fork.
   */

This patch allows pg_rewind to handle data directory synchronization is much more general way. For instance, user can return promoted standby to old master.

In this patch target timeline history is exposed as global variable. Index in target timeline history is used in function interfaces instead of specifying TLI directly. Thus, SimpleXLogPageRead() can easily start reading XLOGs from next timeline when current timeline ends.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> attached patch allows pg_rewind to work when target timeline was switched.
> Actually, this patch fixes TODO from pg_rewind comments.
>
>   /*
>    * Trace the history backwards, until we hit the target timeline.
>    *
>    * TODO: This assumes that there are no timeline switches on the target
>    * cluster after the fork.
>    */
>
> This patch allows pg_rewind to handle data directory synchronization is much
> more general way. For instance, user can return promoted standby to old
> master.

Yes. That would be useful.

> In this patch target timeline history is exposed as global variable. Index
> in target timeline history is used in function interfaces instead of
> specifying TLI directly. Thus, SimpleXLogPageRead() can easily start reading
> XLOGs from next timeline when current timeline ends.

The implementation looks rather neat by having a first look at it, but
the attached patch fails to compile:
pg_rewind.c:488:4: error: use of undeclared identifier 'targetEntry'                       targetEntry =
&targetHistory[i];                      ^
 
Nice to see as well that this has been added to the CF of September.
Regards,
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Wed, Jul 22, 2015 at 8:48 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> attached patch allows pg_rewind to work when target timeline was switched.
> Actually, this patch fixes TODO from pg_rewind comments.
>
>   /*
>    * Trace the history backwards, until we hit the target timeline.
>    *
>    * TODO: This assumes that there are no timeline switches on the target
>    * cluster after the fork.
>    */
>
> This patch allows pg_rewind to handle data directory synchronization is much
> more general way. For instance, user can return promoted standby to old
> master.

Yes. That would be useful.

> In this patch target timeline history is exposed as global variable. Index
> in target timeline history is used in function interfaces instead of
> specifying TLI directly. Thus, SimpleXLogPageRead() can easily start reading
> XLOGs from next timeline when current timeline ends.

The implementation looks rather neat by having a first look at it, but
the attached patch fails to compile:
pg_rewind.c:488:4: error: use of undeclared identifier 'targetEntry'
                        targetEntry = &targetHistory[i];
                        ^
Nice to see as well that this has been added to the CF of September.

Uh, sorry. Fixed version is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:


On Wed, Jul 22, 2015 at 4:28 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Wed, Jul 22, 2015 at 8:48 AM, Michael Paquier <michael.paquier@gmail.com> wrote
On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> attached patch allows pg_rewind to work when target timeline was switched.
> Actually, this patch fixes TODO from pg_rewind comments.
>
>   /*
>    * Trace the history backwards, until we hit the target timeline.
>    *
>    * TODO: This assumes that there are no timeline switches on the target
>    * cluster after the fork.
>    */
>
> This patch allows pg_rewind to handle data directory synchronization is much
> more general way.
For instance, user can return promoted standby to old master.

+               /*
+                * Since incomplete segments are copied into next timelines, find the
+                * lastest timeline holding required segment.
+                */
+               while (private->tliIndex < targetNentries - 1 &&
+                          targetHistory[private->tliIndex].end < targetSegEnd)
+               {
+                       private->tliIndex++;
+                       tli_index++;
+               }
It seems to me that the patch is able to handle timeline switches onwards, but not backwards and this is what would be required to return a promoted standby, that got switched to let's say timeline 2 when promoted, to an old master, that is still on timeline 1. This code actually fails when scanning for the last checkpoint before WAL forked as it will be on the timeline 1 of the old master. Imagine for example that the WAL has forked at  0/30XXXXX which is saved in segment 000000020000000000000003 (say 2/0/3) on the promoted standby, and that the last checkpoint record is on 0/20XXXXX, which is part of 000000010000000000000002 (1/0/2). I think that we should scan 2/0/3 (not the partial segment 1/0/3), and then 1/0/2 when looking for the last checkpoint record. Hence the startup index TLI should be set to the highest timeline and should be decremented depending on what is in the history file.
The code above looks correct to me when scanning the WAL history onwards though, which is what is done when extracting the page map, but not backwards when we try to find the last common checkpoint record. This code actually fails trying to open 2/0/2 that does not exist in the promoted standby's pg_xlog in my test case.

Attached is a small script I have used to reproduce the failure.

I think that the documentation needs a brush up as well to outline the fact that pg_rewind would be able to put back as well a standby in a cluster, after for example an operator mistake when promoting a node that should not be.
Thoughts?
--
Michael
Attachment

Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Thu, Aug 20, 2015 at 9:57 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Jul 22, 2015 at 4:28 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Wed, Jul 22, 2015 at 8:48 AM, Michael Paquier <michael.paquier@gmail.com> wrote
On Mon, Jul 20, 2015 at 9:18 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> attached patch allows pg_rewind to work when target timeline was switched.
> Actually, this patch fixes TODO from pg_rewind comments.
>
>   /*
>    * Trace the history backwards, until we hit the target timeline.
>    *
>    * TODO: This assumes that there are no timeline switches on the target
>    * cluster after the fork.
>    */
>
> This patch allows pg_rewind to handle data directory synchronization is much
> more general way.
For instance, user can return promoted standby to old master.

+               /*
+                * Since incomplete segments are copied into next timelines, find the
+                * lastest timeline holding required segment.
+                */
+               while (private->tliIndex < targetNentries - 1 &&
+                          targetHistory[private->tliIndex].end < targetSegEnd)
+               {
+                       private->tliIndex++;
+                       tli_index++;
+               }
It seems to me that the patch is able to handle timeline switches onwards, but not backwards and this is what would be required to return a promoted standby, that got switched to let's say timeline 2 when promoted, to an old master, that is still on timeline 1. This code actually fails when scanning for the last checkpoint before WAL forked as it will be on the timeline 1 of the old master. Imagine for example that the WAL has forked at  0/30XXXXX which is saved in segment 000000020000000000000003 (say 2/0/3) on the promoted standby, and that the last checkpoint record is on 0/20XXXXX, which is part of 000000010000000000000002 (1/0/2). I think that we should scan 2/0/3 (not the partial segment 1/0/3), and then 1/0/2 when looking for the last checkpoint record. Hence the startup index TLI should be set to the highest timeline and should be decremented depending on what is in the history file.
The code above looks correct to me when scanning the WAL history onwards though, which is what is done when extracting the page map, but not backwards when we try to find the last common checkpoint record. This code actually fails trying to open 2/0/2 that does not exist in the promoted standby's pg_xlog in my test case.

Attached is a small script I have used to reproduce the failure.
 
Right, thanks! It should be fixed in the attached version of patch.

I think that the documentation needs a brush up as well to outline the fact that pg_rewind would be able to put back as well a standby in a cluster, after for example an operator mistake when promoting a node that should not be.

Yes. But from current pg_rewind documentation I can't figure out that it can't do so. However, we can add an explicit statement which claims that it can.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Attachment

Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Tue, Sep 8, 2015 at 1:14 AM, Alexander Korotkov wrote:
> On Thu, Aug 20, 2015 at 9:57 AM, Michael Paquier wrote:
>> The code above looks correct to me when scanning the WAL history onwards
>> though, which is what is done when extracting the page map, but not
>> backwards when we try to find the last common checkpoint record. This code
>> actually fails trying to open 2/0/2 that does not exist in the promoted
>> standby's pg_xlog in my test case.
>>
>> Attached is a small script I have used to reproduce the failure.
>
>
> Right, thanks! It should be fixed in the attached version of patch.

So, instead of a code review, I have been torturing your patch and did
advanced tests on it with the attached script, that creates a cluster
as follows:
 master (5432)
  /        \
 1 (5433)   2 (5434)
 |
 3 (5435)
Once cluster is created, nodes are promoted in a certain order giving
them different timeline jump properties:
- master, stays on tli 1
- standby 1, tli 1->2
- standby 2, tli 1->3
- standby 3, tli 1->2->4
And data is inserted on each of them to make WAL fork properly.
Finally the script tries to rewind one node using another node as
source, and then tries to link this target node back to the source
node via streaming replication.

I have tested all the one-one permutations possible in the structure
above (see commented portions at the bottom of my script), and all of
them worked. I have to say that from the testing prospective this
patch looks in great shape, and will greatly improve the use cases of
pg_rewind!

I am planning to do as well a detailed code review rather soon.
Regards,
--
Michael

Attachment

Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Sep 8, 2015 at 1:14 AM, Alexander Korotkov wrote:
> On Thu, Aug 20, 2015 at 9:57 AM, Michael Paquier wrote:
>> The code above looks correct to me when scanning the WAL history onwards
>> though, which is what is done when extracting the page map, but not
>> backwards when we try to find the last common checkpoint record. This code
>> actually fails trying to open 2/0/2 that does not exist in the promoted
>> standby's pg_xlog in my test case.
>>
>> Attached is a small script I have used to reproduce the failure.
>
>
> Right, thanks! It should be fixed in the attached version of patch.

So, instead of a code review, I have been torturing your patch and did
advanced tests on it with the attached script, that creates a cluster
as follows:
 master (5432)
  /        \
 1 (5433)   2 (5434)
 |
 3 (5435)
Once cluster is created, nodes are promoted in a certain order giving
them different timeline jump properties:
- master, stays on tli 1
- standby 1, tli 1->2
- standby 2, tli 1->3
- standby 3, tli 1->2->4
And data is inserted on each of them to make WAL fork properly.
Finally the script tries to rewind one node using another node as
source, and then tries to link this target node back to the source
node via streaming replication.

I have tested all the one-one permutations possible in the structure
above (see commented portions at the bottom of my script), and all of
them worked. I have to say that from the testing prospective this
patch looks in great shape, and will greatly improve the use cases of
pg_rewind!

Great! Many thanks for your testing! 

I am planning to do as well a detailed code review rather soon.

Good, thanks.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:


On Wed, Sep 9, 2015 at 3:27 AM, Alexander Korotkov wrote:
On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier wrote:
I am planning to do as well a detailed code review rather soon.

Good, thanks.

When testing a bit more complex structures, it occurred to me that we may want as well to use as a source node a standby. For example based on the case of my cluster above:
 master (5432)
  /              \
 1 (5433)   2 (5434)
 |
 3 (5435)
If I try for example to rebuild the cluster as follows there will be failures:
1) Rewind with source = 3, target = 2
2) Start 3 and 2
3) Shutdown 2
3) Rewind source = 2, target = 1, failure with:
source data directory must be shut down cleanly

It seems to me that we should allow a node that has been shutdowned in recovery to be used as a source for rewind as well, as follows:
-       if (datadir_source && ControlFile_source.state != DB_SHUTDOWNED)
+       if (datadir_source &&
+               ControlFile_source.state != DB_SHUTDOWNED &&
+               ControlFile_source.state != DB_SHUTDOWNED_IN_RECOVERY)
                pg_fatal("source data directory must be shut down cleanly\n");
At least your patch justifies in my eyes such a change.

 /*
+ * Find minimum from two xlog pointers assuming invalid pointer is greatest
+ * possible pointer.
+ */
+static XLogRecPtr
+xlPtrMin(XLogRecPtr a, XLogRecPtr b)
+{
+       if (XLogRecPtrIsInvalid(a))
+               return b;
+       else if (XLogRecPtrIsInvalid(b))
+               return a;
+       else
+               return Min(a, b);
+}
This is true as timeline.h tells so, still I think that it would be better to mention that this is this assumption is held in this header file, or simply that timeline history entries at the top have their end position set as InvalidXLogRecPtr which is a synonym of infinity.

The code building the target history file is a duplicate of what is done with the source. Perhaps we could refactor that as a single routine in pg_rewind.c?

Except that, the patch looks pretty neat to me. I was wondering as well: what tests did you run up to now with this patch? I am attaching an updated version of my test script I used for some more complex scenarios. Feel free to play with it.
--
Michael
Attachment

Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier <michael.paquier@gmail.com> wrote:


On Wed, Sep 9, 2015 at 3:27 AM, Alexander Korotkov wrote:
On Tue, Sep 8, 2015 at 10:28 AM, Michael Paquier wrote:
I am planning to do as well a detailed code review rather soon.

Good, thanks.

When testing a bit more complex structures, it occurred to me that we may want as well to use as a source node a standby. For example based on the case of my cluster above:
 master (5432)
  /              \
 1 (5433)   2 (5434)
 |
 3 (5435)
If I try for example to rebuild the cluster as follows there will be failures:
1) Rewind with source = 3, target = 2
2) Start 3 and 2
3) Shutdown 2
3) Rewind source = 2, target = 1, failure with:
source data directory must be shut down cleanly

It seems to me that we should allow a node that has been shutdowned in recovery to be used as a source for rewind as well, as follows:
-       if (datadir_source && ControlFile_source.state != DB_SHUTDOWNED)
+       if (datadir_source &&
+               ControlFile_source.state != DB_SHUTDOWNED &&
+               ControlFile_source.state != DB_SHUTDOWNED_IN_RECOVERY)
                pg_fatal("source data directory must be shut down cleanly\n");
At least your patch justifies in my eyes such a change.
 
It's source is not required to be √ in recovery if we specify it by connection string. This is why I think DB_SHUTDOWNED_IN_RECOVERY is OK when we specify source by data directory. Included in attached patch.

 /*
+ * Find minimum from two xlog pointers assuming invalid pointer is greatest
+ * possible pointer.
+ */
+static XLogRecPtr
+xlPtrMin(XLogRecPtr a, XLogRecPtr b)
+{
+       if (XLogRecPtrIsInvalid(a))
+               return b;
+       else if (XLogRecPtrIsInvalid(b))
+               return a;
+       else
+               return Min(a, b);
+}
This is true as timeline.h tells so, still I think that it would be better to mention that this is this assumption is held in this header file, or simply that timeline history entries at the top have their end position set as InvalidXLogRecPtr which is a synonym of infinity.

Agree. Comment is adjusted.
 
The code building the target history file is a duplicate of what is done with the source. Perhaps we could refactor that as a single routine in pg_rewind.c?
 
Do you mean duplication between rewind_parseTimeLineHistory() and readTimeLineHistory(). We could put readTimeLineHistory() into common with some refactoring. This is the subject of separate patch, though. 

Except that, the patch looks pretty neat to me. I was wondering as well: what tests did you run up to now with this patch? I am attaching an updated version of my test script I used for some more complex scenarios. Feel free to play with it.

I was running manual tests like a noob :) When you attached you bash script, I've switched to it.

A also added additional check in findCommonAncestorTimeline(). Two standbys could be independently promoted and get the same new timeline ID. Now, it's checked that timelines, that we assume to be the same, have equal begins. Begins could still match by coincidence. But the same risk exists in original pg_rewind, though.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
> On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier wrote:
>> The code building the target history file is a duplicate of what is done
>> with the source. Perhaps we could refactor that as a single routine in
>> pg_rewind.c?
>
>
> Do you mean duplication between rewind_parseTimeLineHistory() and
> readTimeLineHistory(). We could put readTimeLineHistory() into common with
> some refactoring. This is the subject of separate patch, though.

Well, there is that :) But I was referring to this part (beware of
useless newlines in your code btw):
+       if (targettli == 1)       {
-               TimeLineHistoryEntry *entry = &sourceHistory[i];
+               targetHistory = (TimeLineHistoryEntry *)
pg_malloc(sizeof(TimeLineHistoryEntry));
+               targetHistory->tli = targettli;
+               targetHistory->begin = targetHistory->end = InvalidXLogRecPtr;
+               targetNentries = 1;
+
+       }
Your patch generates a timeline history array for the target node, and
HEAD does exactly the same for the source node, so IMO your patch is
duplicating code, the only difference being that fetchFile is used for
the source history file and slurpFile is used for the target history
file. Up to now the code duplication caused by the difference that the
target is always a local instance, and the source may be either local
or remote has created the interface that we have now, but I think that
we should refactor fetch.c such as the routines it contains do not
rely anymore on datadir_source, and use instead a datadir argument. If
datadir is NULL, the remote code path is used. If it is not NULL,
local code path is used. This way we can make the target node use
fetchFile only, and remove the duplication your patch is adding, as
well as future ones like that. Thoughts?

>> Except that, the patch looks pretty neat to me. I was wondering as well:
>> what tests did you run up to now with this patch? I am attaching an updated
>> version of my test script I used for some more complex scenarios. Feel free
>> to play with it.
>
> I was running manual tests like a noob :) When you attached you bash script,
> I've switched to it.

[product_placement]The patch to improve the recovery test suite
submitted in this CF would have eased the need of bash-ing test cases
here, and we could have test cases directly included in your
patch.[/product_placement]

> A also added additional check in findCommonAncestorTimeline(). Two standbys
> could be independently promoted and get the same new timeline ID. Now, it's
> checked that timelines, that we assume to be the same, have equal begins.
> Begins could still match by coincidence. But the same risk exists in
> original pg_rewind, though.

Really? pg_rewind blocks attempts to rewind two nodes that are already
on the same timeline before checking if their timeline history map at
some point or not:       /*        * If both clusters are already on the same timeline, there's nothing to        * do.
      */       if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
 
ControlFile_source.checkPointCopy.ThisTimeLineID)               pg_fatal("source and target cluster are on the same
timeline\n");
And this seems really justified to me. Imagine that you have one
master, with two standbys linked to it. If both standbys are promoted
to the same timeline, you could easily replug them to the master, but
I fail to see the point to be able to replug one promoted standby with
the other in this case: those nodes have segment and history files
that map with each other, an operator could easily mess up things in
such a configuration.
Regards,
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Alvaro Herrera
Date:
Michael Paquier wrote:
> On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:

> > I was running manual tests like a noob :) When you attached you bash script,
> > I've switched to it.
> 
> [product_placement]The patch to improve the recovery test suite
> submitted in this CF would have eased the need of bash-ing test cases
> here, and we could have test cases directly included in your
> patch.[/product_placement]

The problem of bash test scripts is that they don't run on Windows,
which is why we settled on Perl-based TAP tests.  I think having tests
for various aspects of recovery is useful, because being relatively new
it's still being hacked in various ways and it's good to ensure that
things that were previously working continue to work.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
> On Wed, Sep 9, 2015 at 9:01 AM, Michael Paquier wrote:
>> The code building the target history file is a duplicate of what is done
>> with the source. Perhaps we could refactor that as a single routine in
>> pg_rewind.c?
>
>
> Do you mean duplication between rewind_parseTimeLineHistory() and
> readTimeLineHistory(). We could put readTimeLineHistory() into common with
> some refactoring. This is the subject of separate patch, though.

Well, there is that :) But I was referring to this part (beware of
useless newlines in your code btw):
+       if (targettli == 1)
        {
-               TimeLineHistoryEntry *entry = &sourceHistory[i];
+               targetHistory = (TimeLineHistoryEntry *)
pg_malloc(sizeof(TimeLineHistoryEntry));
+               targetHistory->tli = targettli;
+               targetHistory->begin = targetHistory->end = InvalidXLogRecPtr;
+               targetNentries = 1;
+
+       }
Your patch generates a timeline history array for the target node, and
HEAD does exactly the same for the source node, so IMO your patch is
duplicating code, the only difference being that fetchFile is used for
the source history file and slurpFile is used for the target history
file. Up to now the code duplication caused by the difference that the
target is always a local instance, and the source may be either local
or remote has created the interface that we have now, but I think that
we should refactor fetch.c such as the routines it contains do not
rely anymore on datadir_source, and use instead a datadir argument. If
datadir is NULL, the remote code path is used. If it is not NULL,
local code path is used. This way we can make the target node use
fetchFile only, and remove the duplication your patch is adding, as
well as future ones like that. Thoughts?

OK, I get it. I tried to get rid of code duplication in the attached patch. There is getTimelineHistory() function which gets control file as argument and fetches history from appropriate path.
 
>> Except that, the patch looks pretty neat to me. I was wondering as well:
>> what tests did you run up to now with this patch? I am attaching an updated
>> version of my test script I used for some more complex scenarios. Feel free
>> to play with it.
>
> I was running manual tests like a noob :) When you attached you bash script,
> I've switched to it.

[product_placement]The patch to improve the recovery test suite
submitted in this CF would have eased the need of bash-ing test cases
here, and we could have test cases directly included in your
patch.[/product_placement]

> A also added additional check in findCommonAncestorTimeline(). Two standbys
> could be independently promoted and get the same new timeline ID. Now, it's
> checked that timelines, that we assume to be the same, have equal begins.
> Begins could still match by coincidence. But the same risk exists in
> original pg_rewind, though.

Really? pg_rewind blocks attempts to rewind two nodes that are already
on the same timeline before checking if their timeline history map at
some point or not:
        /*
         * If both clusters are already on the same timeline, there's nothing to
         * do.
         */
        if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
ControlFile_source.checkPointCopy.ThisTimeLineID)
                pg_fatal("source and target cluster are on the same
timeline\n");
And this seems really justified to me. Imagine that you have one
master, with two standbys linked to it. If both standbys are promoted
to the same timeline, you could easily replug them to the master, but
I fail to see the point to be able to replug one promoted standby with
the other in this case: those nodes have segment and history files
that map with each other, an operator could easily mess up things in
such a configuration.

Imagine following configuration of server.
  1
 / \
2   3
    |
    4

Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4 is cascaded standby for node 3.
Then node 2 and node 3 are promoted. They both got timeline number 2. Then node 3 is promoted and gets timeline number 3.
Then we could try to rewind node 4 with node 2 as source. How pg_rewind could know that timeline number 2 for those nodes is not the same?
We can only check that those timelines are forked from timeline 1 at the same place. But coincidence is still possible.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Wed, Sep 16, 2015 at 7:47 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
> A also added additional check in findCommonAncestorTimeline(). Two standbys
> could be independently promoted and get the same new timeline ID. Now, it's
> checked that timelines, that we assume to be the same, have equal begins.
> Begins could still match by coincidence. But the same risk exists in
> original pg_rewind, though.

Really? pg_rewind blocks attempts to rewind two nodes that are already
on the same timeline before checking if their timeline history map at
some point or not:
        /*
         * If both clusters are already on the same timeline, there's nothing to
         * do.
         */
        if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
ControlFile_source.checkPointCopy.ThisTimeLineID)
                pg_fatal("source and target cluster are on the same
timeline\n");
And this seems really justified to me. Imagine that you have one
master, with two standbys linked to it. If both standbys are promoted
to the same timeline, you could easily replug them to the master, but
I fail to see the point to be able to replug one promoted standby with
the other in this case: those nodes have segment and history files
that map with each other, an operator could easily mess up things in
such a configuration.

Imagine following configuration of server.
  1
 / \
2   3
    |
    4

Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4 is cascaded standby for node 3.
Then node 2 and node 3 are promoted. They both got timeline number 2. Then node 3 is promoted and gets timeline number 3.
Then we could try to rewind node 4 with node 2 as source. How pg_rewind could know that timeline number 2 for those nodes is not the same?
We can only check that those timelines are forked from timeline 1 at the same place. But coincidence is still possible.

BTW, it would be an option to generate system_identifier to each new timeline, by analogy of initdb do for the whole WAL.
Having such system_identifiers we can distinguish different timeline which have assigned same ids.
Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 

Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Fri, Sep 18, 2015 at 9:00 AM, Alexander Korotkov wrote:
> BTW, it would be an option to generate system_identifier to each new
> timeline, by analogy of initdb do for the whole WAL.
> Having such system_identifiers we can distinguish different timeline which
> have assigned same ids.
> Any thoughts?

If you mean a new field incorporated in XLogLongPageHeader and
ControlFile to ensure that a new timeline generated is unique across
the same installation identified with system_identifier, then I'm not
against it for something up to 2 bytes (even 1 byte would be fine),
though it seems that there is a very narrow need for it. Do you have
cases in mind that could use it? Even in the case of pg_rewind, it
seems to me we would not need this extra timeline-based ID. Per your
case above, it is possible to rewind node 4 on timeline 3 using node 2
on timeline 2 when both timelines have forked exactly at the same
point from timeline 1, by having node 4 beginning to replay from
timeline 1 (last checkpoint record before WAL forked), and if I am
reading your patch correctly that's what you do.
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Wed, Sep 16, 2015 at 9:47 AM, Alexander Korotkov wrote:
> On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier wrote:
> OK, I get it. I tried to get rid of code duplication in the attached patch.
> There is getTimelineHistory() function which gets control file as argument
> and fetches history from appropriate path.

Cool, thanks!

> Imagine following configuration of server.
>   1
>  / \
> 2   3
>     |
>     4
>
> Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4
> is cascaded standby for node 3.
> Then node 2 and node 3 are promoted. They both got timeline number 2. Then
> node 3 is promoted and gets timeline number 3.

Typo. You mean node 4 getting timeline 3 here.

> Then we could try to rewind node 4 with node 2 as source. How pg_rewind
> could know that timeline number 2 for those nodes is not the same?
> We can only check that those timelines are forked from timeline 1 at the
> same place. But coincidence is still possible.

OK, I see your point and you are right. This additional check allows
pg_rewind to switch one timeline back and make the scan of blocks
begin at the real origin of both timelines. I had in mind the case
where you tried to actually rewind node 2 to node 3 actually which
would not be possible with your patch, and by thinking more about that
I think that it could be possible to remove the check I am listing
below and rely on the checks in the history files, basically what is
in findCommonAncestorTimeline:       if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
ControlFile_source.checkPointCopy.ThisTimeLineID)              pg_fatal("source and target cluster are on the same
 
timeline\n");
Alexander, what do you think about that? I think that we should be
able to rewind with for example node 2 as target and node 3 as source,
and vice-versa as per the example you gave even if both nodes are on
the same timeline, just that they forked at a different point. Could
you test this particular case? Using my script, simply be sure to
switch archive_mode to on/off depending on the node, aka only 3 and 4
do archiving.

+               /*
+                * Since incomplete segments are copied into next
timelines, find the
+                * lastest timeline holding required segment. Assuming
we can move
+                * in xlog both forward and backward, consider also
switching timeline
+                * back.
+                */
s/lastest/correct? "lastest" does sound very English to me even if
that's grammatically correct.

+ * Find minimum from two xlog pointers assuming invalid pointer (0) means
+ * infinity as timeline.h states.
This needs an update, now timeline.h points out correctly that this is
not 0, but InvalidXLogRecPtr.

+       /* Retreive timelines for both source and target */
s/Retreive/Retrieve.

The refactoring of getTimelineHistory as you propose looks like a good
idea to me, I tried to remove by myself the difference between source
and target in copy_fetch.c and friends but this gets uglier,
particularly because of datadir_source in copy_file_range. Not worth
it.
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Fri, Sep 18, 2015 at 12:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Sep 18, 2015 at 9:00 AM, Alexander Korotkov wrote:
>> BTW, it would be an option to generate system_identifier to each new
>> timeline, by analogy of initdb do for the whole WAL.
>> Having such system_identifiers we can distinguish different timeline which
>> have assigned same ids.
>> Any thoughts?
>
> If you mean a new field incorporated in XLogLongPageHeader and
> ControlFile to ensure that a new timeline generated is unique across
> the same installation identified with system_identifier, then I'm not
> against it for something up to 2 bytes (even 1 byte would be fine),

Er, 2 bytes may be a bad idea as well, 1/16k% chance of collision
looks dangerous when rewinding a node... It could cause silent data
corruption on the standby being rewounded.
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote:
> The refactoring of getTimelineHistory as you propose looks like a good
> idea to me, I tried to remove by myself the difference between source
> and target in copy_fetch.c and friends but this gets uglier,
> particularly because of datadir_source in copy_file_range. Not worth
> it.

Forgot that:   if (ControlFile_target.state != DB_SHUTDOWNED)       pg_fatal("target server must be shut down
cleanly\n");
We may want to allow a target node shutdowned in recovery as well here.
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Sat, Sep 19, 2015 at 8:27 AM, Michael Paquier wrote:
> On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote:
>> The refactoring of getTimelineHistory as you propose looks like a good
>> idea to me, I tried to remove by myself the difference between source
>> and target in copy_fetch.c and friends but this gets uglier,
>> particularly because of datadir_source in copy_file_range. Not worth
>> it.
>
> Forgot that:
>     if (ControlFile_target.state != DB_SHUTDOWNED)
>         pg_fatal("target server must be shut down cleanly\n");
> We may want to allow a target node shutdowned in recovery as well here.

So, attached is a more polished version of this patch, cleaned up of
its typos with as well other things. I have noticed for example that
it would be more useful to add the debug information of a timeline
file fetched from the source or a target server directly in
getTimelineHistory. I have as well updated a couple of comments in the
code regarding the fact that we do not necessarily use a master as a
target node, and mentioned in findCommonAncestorTimeline that we check
as well the start position of a timeline to cover the case where both
target and source node forked at the same timeline number but with a
different WAL fork position.
I am marking this patch as ready for committer. It would be cool in
the future to use the recovery test suite to have more advanced
scenarios tested, but it seems a shame to block this patch because of
that.
Regards,
--
Michael

Attachment

Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Sat, Sep 19, 2015 at 2:25 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
OK, I see your point and you are right. This additional check allows
pg_rewind to switch one timeline back and make the scan of blocks
begin at the real origin of both timelines. I had in mind the case
where you tried to actually rewind node 2 to node 3 actually which
would not be possible with your patch, and by thinking more about that
I think that it could be possible to remove the check I am listing
below and rely on the checks in the history files, basically what is
in findCommonAncestorTimeline:
        if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
            ControlFile_source.checkPointCopy.ThisTimeLineID)
                pg_fatal("source and target cluster are on the same
timeline\n");
Alexander, what do you think about that? I think that we should be
able to rewind with for example node 2 as target and node 3 as source,
and vice-versa as per the example you gave even if both nodes are on
the same timeline, just that they forked at a different point. Could
you test this particular case? Using my script, simply be sure to
switch archive_mode to on/off depending on the node, aka only 3 and 4
do archiving.

I think relying on different fork point is not safe enough. Imagine more complex case.

  1
 / \
2   3
|   |
4   5

At first, nodes 2 and 3 are promoted at the same point and they both get timeline 2.
Then nodes 4 and 5 are promoted at different points and they both get timeline 3.
Then we can try to rewind node 4 with node 5 as the source or vice versa. In this case we can't find collision of timeline 2.

The same collision could happen even when source and target are on the different timeline number. However, having the on the same timeline numbers is suspicious enough to disallow it until we have additional checks.

I could propose following plan:
  1. Commit this patch without allowing rewind when target and source are on the same timelines.
  2. Make additional checks for distinguish different timelines with the same numbers.
  3. Allow rewind when target and source are on the same timelines.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Wed, Sep 30, 2015 at 9:46 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Sep 19, 2015 at 8:27 AM, Michael Paquier wrote:
> On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote:
>> The refactoring of getTimelineHistory as you propose looks like a good
>> idea to me, I tried to remove by myself the difference between source
>> and target in copy_fetch.c and friends but this gets uglier,
>> particularly because of datadir_source in copy_file_range. Not worth
>> it.
>
> Forgot that:
>     if (ControlFile_target.state != DB_SHUTDOWNED)
>         pg_fatal("target server must be shut down cleanly\n");
> We may want to allow a target node shutdowned in recovery as well here.

So, attached is a more polished version of this patch, cleaned up of
its typos with as well other things. I have noticed for example that
it would be more useful to add the debug information of a timeline
file fetched from the source or a target server directly in
getTimelineHistory. I have as well updated a couple of comments in the
code regarding the fact that we do not necessarily use a master as a
target node, and mentioned in findCommonAncestorTimeline that we check
as well the start position of a timeline to cover the case where both
target and source node forked at the same timeline number but with a
different WAL fork position.
I am marking this patch as ready for committer. It would be cool in
the future to use the recovery test suite to have more advanced
scenarios tested, but it seems a shame to block this patch because of
that.

Thanks a lot. Now patch looks much better.
I found that it doesn't applies cleanly on the current master. Rebased version is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Wed, Oct 14, 2015 at 6:00 PM, Alexander Korotkov wrote:
> On Sat, Sep 19, 2015 at 2:25 AM, Michael Paquier wrote:
>> Alexander, what do you think about that? I think that we should be
>> able to rewind with for example node 2 as target and node 3 as source,
>> and vice-versa as per the example you gave even if both nodes are on
>> the same timeline, just that they forked at a different point. Could
>> you test this particular case? Using my script, simply be sure to
>> switch archive_mode to on/off depending on the node, aka only 3 and 4
>> do archiving.
>
> I think relying on different fork point is not safe enough. Imagine more
> complex case.
>
>   1
>  / \
> 2   3
> |   |
> 4   5
>
> At first, nodes 2 and 3 are promoted at the same point and they both get
> timeline 2.
> Then nodes 4 and 5 are promoted at different points and they both get
> timeline 3.

You mean that those nodes get the history file generated at timeline 2
by either 2 or 3 here I guess.

> Then we can try to rewind node 4 with node 5 as the source or vice versa. In
> this case we can't find collision of timeline 2.
> The same collision could happen even when source and target are on the
> different timeline number. However, having the on the same timeline numbers
> is suspicious enough to disallow it until we have additional checks.

Check.

> I could propose following plan:
> Commit this patch without allowing rewind when target and source are on the
> same timelines.
> Make additional checks for distinguish different timelines with the same
> numbers.

That's the addition of the 8-byte timeline ID in the XLOG segment
header. We would need to store it in the control data file as well.

> Allow rewind when target and source are on the same timelines.

Check. This relies on the fact that a portion of the nodes select
their new timeline without knowing the previous history. I believe
that this does not exist much in a perfect world, but I've seen enough
setups messed up to conclude that this would surely exist on the
field. In short this plan looks fine to me. Your patch is very
valuable even with this same-timeline restriction in place.
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Teodor Sigaev
Date:
Seems, patch is ready to commit. But it needs some documentation.

Alexander Korotkov wrote:
> On Wed, Sep 30, 2015 at 9:46 AM, Michael Paquier <michael.paquier@gmail.com
> <mailto:michael.paquier@gmail.com>> wrote:
>
>     On Sat, Sep 19, 2015 at 8:27 AM, Michael Paquier wrote:
>     > On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote:
>     >> The refactoring of getTimelineHistory as you propose looks like a good
>     >> idea to me, I tried to remove by myself the difference between source
>     >> and target in copy_fetch.c and friends but this gets uglier,
>     >> particularly because of datadir_source in copy_file_range. Not worth
>     >> it.
>     >
>     > Forgot that:
>     >     if (ControlFile_target.state != DB_SHUTDOWNED)
>     >         pg_fatal("target server must be shut down cleanly\n");
>     > We may want to allow a target node shutdowned in recovery as well here.
>
>     So, attached is a more polished version of this patch, cleaned up of
>     its typos with as well other things. I have noticed for example that
>     it would be more useful to add the debug information of a timeline
>     file fetched from the source or a target server directly in
>     getTimelineHistory. I have as well updated a couple of comments in the
>     code regarding the fact that we do not necessarily use a master as a
>     target node, and mentioned in findCommonAncestorTimeline that we check
>     as well the start position of a timeline to cover the case where both
>     target and source node forked at the same timeline number but with a
>     different WAL fork position.
>     I am marking this patch as ready for committer. It would be cool in
>     the future to use the recovery test suite to have more advanced
>     scenarios tested, but it seems a shame to block this patch because of
>     that.
>
>
> Thanks a lot. Now patch looks much better.
> I found that it doesn't applies cleanly on the current master. Rebased version
> is attached.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
> The Russian Postgres Company
>
>
>

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Fri, Nov 27, 2015 at 11:42 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Seems, patch is ready to commit. But it needs some documentation.

Of what kind? The documentation of pg_rewind is rather explicit on the
subject and looks fine as-is, and that's what Alexander and I agreed
on upthread. If there is something forgotten though, this may be the
fact that we do not mention in 9.5's documentation that pg_rewind can
*not* handle timeline switches.
-- 
Michael



Re: Use pg_rewind when target timeline was switched

From
Alexander Korotkov
Date:
On Sat, Nov 28, 2015 at 2:27 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Nov 27, 2015 at 11:42 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Seems, patch is ready to commit. But it needs some documentation.

Of what kind? The documentation of pg_rewind is rather explicit on the
subject and looks fine as-is, and that's what Alexander and I agreed
on upthread. If there is something forgotten though, this may be the
fact that we do not mention in 9.5's documentation that pg_rewind can
*not* handle timeline switches.

​However, we can add some explicit statements​ about new pg_rewind capabilities. Please, check attached patch for those statements.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: Use pg_rewind when target timeline was switched

From
Teodor Sigaev
Date:
Thank you, committed.

Alexander Korotkov wrote:
> On Sat, Nov 28, 2015 at 2:27 PM, Michael Paquier <michael.paquier@gmail.com
> <mailto:michael.paquier@gmail.com>> wrote:
>
>     On Fri, Nov 27, 2015 at 11:42 PM, Teodor Sigaev <teodor@sigaev.ru
>     <mailto:teodor@sigaev.ru>> wrote:
>     > Seems, patch is ready to commit. But it needs some documentation.
>
>     Of what kind? The documentation of pg_rewind is rather explicit on the
>     subject and looks fine as-is, and that's what Alexander and I agreed
>     on upthread. If there is something forgotten though, this may be the
>     fact that we do not mention in 9.5's documentation that pg_rewind can
>     *not* handle timeline switches.
>
>
> ​However, we can add some explicit statements​ about new pg_rewind capabilities.
> Please, check attached patch for those statements.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
> The Russian Postgres Company
>
>
>

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: Use pg_rewind when target timeline was switched

From
Michael Paquier
Date:
On Wed, Dec 2, 2015 at 12:57 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thank you, committed.

Youhou, thanks!
I still think that we should have a test case for this patch close to
the script I sent on this thread. I'll get into it once the
infrastructure patch for recovery regression tests gets in.
-- 
Michael