Thread: Corner-case bug in pg_rewind

Corner-case bug in pg_rewind

From

Ian Barwick

Date:

11 September 2020, 06:42:34

Hi

Take the following cluster with:
  - node1 (initial primary)
  - node2 (standby)
  - node3 (standby)

Following activity takes place (greatly simplified from a real-world situation):

1. node1 is shut down.
2. node3 is promoted
3. node2 is attached to node3.
4. node1 is attached to node3
5. node1 is then promoted (creating a split brain situation with
   node1 and node3 as primaries)
6. node2 and node3 are shut down (in that order).
7. pg_rewind is executed to reset node2 so it can reattach
   to node1 as a standby. pg_rewind claims:

    pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
    pg_rewind: no rewind required

8. based off that assurance, node2 is restarted with replication configuration
   pointing to node1 - but it is unable to attach, with node2's log reporting
   something like:

      new timeline 3 forked off current database system timeline 2
before current recovery point X/XXXXXXX

The cause is that pg_rewind is assuming that if the node's last
checkpoint matches the
divergence point, no rewind is needed:

    if (chkptendrec == divergerec)
        rewind_needed = false;

but in this case there *are* records beyond the last checkpoint, which can be
inferred from "minRecoveryPoint" - but this is not checked.

Attached patch addresses this. It includes a test, which doesn't make use of
the RewindTest module, as that hard-codes a primary and a standby, while here
three nodes are needed (I can't come up with a situation where this can be
reproduced with only two nodes). The test sets "wal_keep_size" so would need
modification for Pg12 and earlier.

Regards

Ian Barwick

-- 
  Ian Barwick                   https://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pg_rewind-minRecoveryPoint.v1.patch

Re: Corner-case bug in pg_rewind

From

Heikki Linnakangas

Date:

29 September 2020, 12:00:13

On 11/09/2020 09:42, Ian Barwick wrote:
> Take the following cluster with:
>    - node1 (initial primary)
>    - node2 (standby)
>    - node3 (standby)
> 
> Following activity takes place (greatly simplified from a real-world situation):
> 
> 1. node1 is shut down.
> 2. node3 is promoted
> 3. node2 is attached to node3.
> 4. node1 is attached to node3
> 5. node1 is then promoted (creating a split brain situation with
>     node1 and node3 as primaries)
> 6. node2 and node3 are shut down (in that order).
> 7. pg_rewind is executed to reset node2 so it can reattach
>     to node1 as a standby. pg_rewind claims:
> 
>      pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
>      pg_rewind: no rewind required
> 
> 8. based off that assurance, node2 is restarted with replication configuration
>     pointing to node1 - but it is unable to attach, with node2's log reporting
>     something like:
> 
>        new timeline 3 forked off current database system timeline 2
> before current recovery point X/XXXXXXX
> 
> The cause is that pg_rewind is assuming that if the node's last
> checkpoint matches the
> divergence point, no rewind is needed:
> 
>      if (chkptendrec == divergerec)
>          rewind_needed = false;
> 
> but in this case there *are* records beyond the last checkpoint, which can be
> inferred from "minRecoveryPoint" - but this is not checked.

Yep, I think you're right.

> Attached patch addresses this. It includes a test, which doesn't make use of
> the RewindTest module, as that hard-codes a primary and a standby, while here
> three nodes are needed (I can't come up with a situation where this can be
> reproduced with only two nodes). The test sets "wal_keep_size" so would need
> modification for Pg12 and earlier.

I think we also need to change the extractPageMap() call:

>     /*
>      * Read the target WAL from last checkpoint before the point of fork, to
>      * extract all the pages that were modified on the target cluster after
>      * the fork. We can stop reading after reaching the final shutdown record.
>      * XXX: If we supported rewinding a server that was not shut down cleanly,
>      * we would need to replay until the end of WAL here.
>      */
>     if (showprogress)
>         pg_log_info("reading WAL in target");
>     extractPageMap(datadir_target, chkptrec, lastcommontliIndex,
>                    ControlFile_target.checkPoint, restore_command);
>     filemap_finalize();

so that it scans all the way up to minRecoveryPoint, instead of stopping 
at ControlFile_target.checkPoint. Otherwise, we will fail to rewind 
changes that happened after the last checkpoint. And we need to do that 
regardless of the "no rewind required" bug, any time we rewind a server 
that's in DB_SHUTDOWNED_IN_RECOVERY state. I'm surprised none of the 
existing tests have caught that. Am I missing something?

- Heikki

Re: Corner-case bug in pg_rewind

From

Pavel Borisov

Date:

10 November 2020, 09:06:46

I did some effort to review your patch which seems legit to me.

I think some minor things are better to be improved i.e.

1. Comment regarding

------

347 * Check for the possibility that the target is in fact a direct

348 * ancestor of the source. In that case, there is no divergent history
349 * in the target that needs rewinding.

------

are better be reformulated as overall block contents are mostly cover vice versa case of a target NOT being a direct ancestor of the source. Maybe: "We need rewind in cases when .... and don't need only if the target is a direct ancestor of the source." I think it will be more understandable if it would be a commentary with descriptions of all cases in the block or no commentary before the block at all with commentaries of these cases on each if/else inside the block instead.

2. When I do the test with no patching of pg_rewind.c I get the output:

-----

# Failed test 'pg_rewind detects rewind needed stderr /(?^:rewinding from last common checkpoint at)/'
# at t/007_min_recovery_point.pl line 107.
# 'pg_rewind: servers diverged at WAL location 0/3000180 on timeline 2
# pg_rewind: no rewind required
# '
# doesn't match '(?^:rewinding from last common checkpoint at)'
# Looks like you failed 1 test of 2.
t/007_min_recovery_point.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/2 subtests

Test Summary Report
-------------------
t/007_min_recovery_point.pl (Wstat: 256 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 1
-------

Maybe it can just give "failed" without so many details?

Also, I think Heikki's notion could be fulfilled.

Apart from this I consider the patch clean, clear, tests are passes and I'd recommend it to commit after a minor improvement described.

Thank you!

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Corner-case bug in pg_rewind

From

Ian Lawrence Barwick

Date:

16 November 2020, 02:49:49

2020年11月10日(火) 18:07 Pavel Borisov <pashkin.elfe@gmail.com>:
>
> I did some effort to review your patch which seems legit to me.

Thanks for the review and feedback.

> I think some minor things are better to be improved i.e.
>
> 1. Comment regarding
> ------
> 347          * Check for the possibility that the target is in fact a direct
> 348          * ancestor of the source. In that case, there is no divergent history
> 349          * in the target that needs rewinding.
> ------
> are better be reformulated as overall block contents are mostly cover vice
> versa case of a target NOT being a direct ancestor of the source. Maybe: "We
> need rewind in cases when .... and don't need only if the target is a direct
> ancestor of the source." I think it will be more understandable if it would be
> a commentary with descriptions of all cases in the block or no commentary
> before the block at all with commentaries of these cases on each if/else
>  inside the block instead.

The comment you cite is not part of this patch. I suspect there might be some
scope for improving the comments in general, but that would need to be done
seperately.

> 2. When I do the test with no patching of pg_rewind.c I get the output:
> -----
> #   Failed test 'pg_rewind detects rewind needed stderr /(?^:rewinding from last common checkpoint at)/'
> #   at t/007_min_recovery_point.pl line 107.
> #                   'pg_rewind: servers diverged at WAL location 0/3000180 on timeline 2
> # pg_rewind: no rewind required
> # '
> #     doesn't match '(?^:rewinding from last common checkpoint at)'
> # Looks like you failed 1 test of 2.
> t/007_min_recovery_point.pl .. Dubious, test returned 1 (wstat 256, 0x100)
> Failed 1/2 subtests
>
> Test Summary Report
> -------------------
> t/007_min_recovery_point.pl (Wstat: 256 Tests: 2 Failed: 1)
>   Failed test:  2
>   Non-zero exit status: 1
> -------
> Maybe it can just give "failed" without so many details?

That's just the way the tests work - an individual test has no influence
over that output.

> Also, I think Heikki's notion could be fulfilled.

I spent a bit of time looking at that suggestion but couldn't actually
verify it was an issue which needed fixing.

Note that the patch  may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.


Regards

Ian Barwick


--
EnterpriseDB: https://www.enterprisedb.com

Re: Corner-case bug in pg_rewind

From

Pavel Borisov

Date:

16 November 2020, 06:58:20

> 1. Comment regarding
> ------
> 347 * Check for the possibility that the target is in fact a direct
> 348 * ancestor of the source. In that case, there is no divergent history
> 349 * in the target that needs rewinding.
> ------
> are better be reformulated as overall block contents are mostly cover vice
> versa case of a target NOT being a direct ancestor of the source. Maybe: "We
> need rewind in cases when .... and don't need only if the target is a direct
> ancestor of the source." I think it will be more understandable if it would be
> a commentary with descriptions of all cases in the block or no commentary
> before the block at all with commentaries of these cases on each if/else
> inside the block instead.

The comment you cite is not part of this patch. I suspect there might be some

Sure, it is comment describes the whole block the patch introduces changes into. If it could be rendered more relevant and anyway you are planning to revise it again, it would be great to change it also. But I don't insist.

Note that the patch may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.

Thank you!

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Corner-case bug in pg_rewind

From

Anastasia Lubennikova

Date:

01 December 2020, 14:41:17

On 16.11.2020 05:49, Ian Lawrence Barwick wrote:
>
> Note that the patch  may require reworking for HEAD due to changes in
> commit 9c4f5192f6. I'll try to take another look this week.
>
>
> Regards
>
> Ian Barwick
>
>
> --
> EnterpriseDB: https://www.enterprisedb.com
>
>
>
Status update for a commitfest entry.

The patch is Waiting on Author for some time. As this is a bug fix, I am 
moving it to the next CF.
Ian, are you planning to continue working on it?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Corner-case bug in pg_rewind

From

Pavel Borisov

Date:

01 December 2020, 14:52:05

Status update for a commitfest entry.

The patch is Waiting on Author for some time. As this is a bug fix, I am
moving it to the next CF.
Ian, are you planning to continue working on it?

As a reviewer, I consider the patch useful and good overall. The comments I left were purely cosmetic. It's a pity to me that this bugfix delayed for such a small reason and outdated, therefore. It would be nice to complete this fix on the next CF.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Corner-case bug in pg_rewind

From

Heikki Linnakangas

Date:

02 December 2020, 11:13:38

On 01/12/2020 16:52, Pavel Borisov wrote:
>     Status update for a commitfest entry.
> 
>     The patch is Waiting on Author for some time. As this is a bug fix,
>     I am
>     moving it to the next CF.
>     Ian, are you planning to continue working on it?
> 
> As a reviewer, I consider the patch useful and good overall. The 
> comments I left were purely cosmetic. It's a pity to me that this bugfix 
> delayed for such a small reason and outdated, therefore. It would be 
> nice to complete this fix on the next CF.

Yeah, we really should fix this..

On 16/11/2020 04:49, Ian Lawrence Barwick wrote:
>> Also, I think Heikki's notion could be fulfilled.
> 
> I spent a bit of time looking at that suggestion but couldn't actually
> verify it was an issue which needed fixing.

Attached are two patches. The first patch is your original patch, 
unmodified (except for a cosmetic rename of the test file). The second 
patch builds on that, demonstrating and fixing the issue I mentioned. It 
took me a while to create a repro for it, it's easily masked by 
incidental full-page writes or because rows created by XIDs that are not 
marked as committed on the other timeline are invisible, but succeeded 
at last.

- Heikki

Attachment

Re: Corner-case bug in pg_rewind

From

Ian Barwick

Date:

02 December 2020, 13:26:53

On 02/12/2020 20:13, Heikki Linnakangas wrote:
> On 01/12/2020 16:52, Pavel Borisov wrote:
>>     Status update for a commitfest entry.
>>
>>     The patch is Waiting on Author for some time. As this is a bug fix,
>>     I am
>>     moving it to the next CF.
>>     Ian, are you planning to continue working on it?
>>
>> As a reviewer, I consider the patch useful and good overall. The comments I left were purely cosmetic. It's a pity
tome that this bugfix delayed for such a small reason and outdated, therefore. It would be nice to complete this fix on
thenext CF.
 
> 
> Yeah, we really should fix this..
> 
> On 16/11/2020 04:49, Ian Lawrence Barwick wrote:
>>> Also, I think Heikki's notion could be fulfilled.
>>
>> I spent a bit of time looking at that suggestion but couldn't actually
>> verify it was an issue which needed fixing.
 >
> Attached are two patches. The first patch is your original patch, unmodified
> (except for a cosmetic rename of the test file). The second patch builds on
> that, demonstrating and fixing the issue I mentioned. It took me a while to
> create a repro for it, it's easily masked by incidental full-page writes or
> because rows created by XIDs that are not marked as committed on the other
> timeline are invisible, but succeeded at last.

Aha, many thanks. I wasn't entirely sure what I was looking for there and
recently haven't had the time or energy to dig any further.


Regards

Ian Barwick

-- 
Ian Barwick                   https://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: Corner-case bug in pg_rewind

From

Heikki Linnakangas

Date:

03 December 2020, 14:10:47

On 02/12/2020 15:26, Ian Barwick wrote:
> On 02/12/2020 20:13, Heikki Linnakangas wrote:
>> Attached are two patches. The first patch is your original patch, unmodified
>> (except for a cosmetic rename of the test file). The second patch builds on
>> that, demonstrating and fixing the issue I mentioned. It took me a while to
>> create a repro for it, it's easily masked by incidental full-page writes or
>> because rows created by XIDs that are not marked as committed on the other
>> timeline are invisible, but succeeded at last.
> 
> Aha, many thanks. I wasn't entirely sure what I was looking for there and
> recently haven't had the time or energy to dig any further.

Ok, pushed and backpatched this now.

Thanks!

- Heikki

Re: Corner-case bug in pg_rewind

From

Pavel Borisov

Date:

03 December 2020, 14:49:37

Ok, pushed and backpatched this now.

Very nice!

Thanks to you all!

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Corner-case bug in pg_rewind

From

Heikki Linnakangas

Date:

03 December 2020, 15:15:46

On 03/12/2020 16:49, Pavel Borisov wrote:
>     Ok, pushed and backpatched this now.
> 
> Very nice!
> Thanks to you all!

Thanks for the review, Pavel! I just realized that I forgot to credit 
you in the commit message. I'm sorry.

- Heikki

Re: Corner-case bug in pg_rewind

From

Pavel Borisov

Date:

03 December 2020, 15:50:44

чт, 3 дек. 2020 г. в 19:15, Heikki Linnakangas <hlinnaka@iki.fi>:

On 03/12/2020 16:49, Pavel Borisov wrote:
> Ok, pushed and backpatched this now.
>
> Very nice!
> Thanks to you all!

Thanks for the review, Pavel! I just realized that I forgot to credit
you in the commit message. I'm sorry.

Don't worry, Heikki. No problem.

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Corner-case bug in pg_rewind

From

Heikki Linnakangas

Date:

03 December 2020, 22:16:51

On 03/12/2020 16:10, Heikki Linnakangas wrote:
> On 02/12/2020 15:26, Ian Barwick wrote:
>> On 02/12/2020 20:13, Heikki Linnakangas wrote:
>>> Attached are two patches. The first patch is your original patch, unmodified
>>> (except for a cosmetic rename of the test file). The second patch builds on
>>> that, demonstrating and fixing the issue I mentioned. It took me a while to
>>> create a repro for it, it's easily masked by incidental full-page writes or
>>> because rows created by XIDs that are not marked as committed on the other
>>> timeline are invisible, but succeeded at last.
>>
>> Aha, many thanks. I wasn't entirely sure what I was looking for there and
>> recently haven't had the time or energy to dig any further.
> 
> Ok, pushed and backpatched this now.

The buildfarm is reporting sporadic failures in the new regression test. 
I suspect it's because of timing issues, where a server is promoted or 
shut down before some data has been replicated. I'll fix that tomorrow 
morning.

- Heikki

Re: Corner-case bug in pg_rewind

From

Heikki Linnakangas

Date:

04 December 2020, 16:43:24

On 04/12/2020 00:16, Heikki Linnakangas wrote:
> On 03/12/2020 16:10, Heikki Linnakangas wrote:
>> On 02/12/2020 15:26, Ian Barwick wrote:
>>> On 02/12/2020 20:13, Heikki Linnakangas wrote:
>>>> Attached are two patches. The first patch is your original patch, unmodified
>>>> (except for a cosmetic rename of the test file). The second patch builds on
>>>> that, demonstrating and fixing the issue I mentioned. It took me a while to
>>>> create a repro for it, it's easily masked by incidental full-page writes or
>>>> because rows created by XIDs that are not marked as committed on the other
>>>> timeline are invisible, but succeeded at last.
>>>
>>> Aha, many thanks. I wasn't entirely sure what I was looking for there and
>>> recently haven't had the time or energy to dig any further.
>>
>> Ok, pushed and backpatched this now.
> 
> The buildfarm is reporting sporadic failures in the new regression test.
> I suspect it's because of timing issues, where a server is promoted or
> shut down before some data has been replicated. I'll fix that tomorrow
> morning.

Fixed, I hope. It took me a while to backpatch, because small 
differences were needed in almost all versions, because some helpful TAP 
test helpers like waiting for a standby to catchup are not available in 
backbranches.

There was one curious difference between versions 9.6 and 10. In v10, 
you can perform a "clean switchover" like this:

1. Shut down primary (A) with "pg_ctl -m fast".

2. Promote the standby (B) with "pg_ctl promote".

3. Reconfigure the old primary (A) as a standby, by creating 
recovery.conf that points to the promoted server, and start it up.

But on 9.6, that leads to an error on the the repurposed primary server (A):

LOG:  primary server contains no more WAL on requested timeline 1
LOG:  new timeline 2 forked off current database system timeline 1 
before current recovery point 0/30000A0

It's not clear to me why that is. It seems that the primary generates 
some WAL at shutdown that doesn't get replicated, before the shutdown 
happens. Or the standby doesn't replay that WAL before it's promoted. 
But we have supported "clean switchover" since 9.1, see commit 
985bd7d497. When you shut down the primary, it should wait until all the 
WAL has been replicated, including the shutdown checkpoint.

Perhaps I was just doing it wrong in the test. Or maybe there's a 
genuine bug in that that was fixed in v10. I worked around that in the 
test by re-initializing the primary standby from backup instead of just 
reconfiguring it as a standby, and that's good enough for this 
particular test, so I'm not planning to dig deeper into that myself.

- Heikki