Thread: Corner-case bug in pg_rewind
Hi Take the following cluster with: - node1 (initial primary) - node2 (standby) - node3 (standby) Following activity takes place (greatly simplified from a real-world situation): 1. node1 is shut down. 2. node3 is promoted 3. node2 is attached to node3. 4. node1 is attached to node3 5. node1 is then promoted (creating a split brain situation with node1 and node3 as primaries) 6. node2 and node3 are shut down (in that order). 7. pg_rewind is executed to reset node2 so it can reattach to node1 as a standby. pg_rewind claims: pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2 pg_rewind: no rewind required 8. based off that assurance, node2 is restarted with replication configuration pointing to node1 - but it is unable to attach, with node2's log reporting something like: new timeline 3 forked off current database system timeline 2 before current recovery point X/XXXXXXX The cause is that pg_rewind is assuming that if the node's last checkpoint matches the divergence point, no rewind is needed: if (chkptendrec == divergerec) rewind_needed = false; but in this case there *are* records beyond the last checkpoint, which can be inferred from "minRecoveryPoint" - but this is not checked. Attached patch addresses this. It includes a test, which doesn't make use of the RewindTest module, as that hard-codes a primary and a standby, while here three nodes are needed (I can't come up with a situation where this can be reproduced with only two nodes). The test sets "wal_keep_size" so would need modification for Pg12 and earlier. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 11/09/2020 09:42, Ian Barwick wrote: > Take the following cluster with: > - node1 (initial primary) > - node2 (standby) > - node3 (standby) > > Following activity takes place (greatly simplified from a real-world situation): > > 1. node1 is shut down. > 2. node3 is promoted > 3. node2 is attached to node3. > 4. node1 is attached to node3 > 5. node1 is then promoted (creating a split brain situation with > node1 and node3 as primaries) > 6. node2 and node3 are shut down (in that order). > 7. pg_rewind is executed to reset node2 so it can reattach > to node1 as a standby. pg_rewind claims: > > pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2 > pg_rewind: no rewind required > > 8. based off that assurance, node2 is restarted with replication configuration > pointing to node1 - but it is unable to attach, with node2's log reporting > something like: > > new timeline 3 forked off current database system timeline 2 > before current recovery point X/XXXXXXX > > The cause is that pg_rewind is assuming that if the node's last > checkpoint matches the > divergence point, no rewind is needed: > > if (chkptendrec == divergerec) > rewind_needed = false; > > but in this case there *are* records beyond the last checkpoint, which can be > inferred from "minRecoveryPoint" - but this is not checked. Yep, I think you're right. > Attached patch addresses this. It includes a test, which doesn't make use of > the RewindTest module, as that hard-codes a primary and a standby, while here > three nodes are needed (I can't come up with a situation where this can be > reproduced with only two nodes). The test sets "wal_keep_size" so would need > modification for Pg12 and earlier. I think we also need to change the extractPageMap() call: > /* > * Read the target WAL from last checkpoint before the point of fork, to > * extract all the pages that were modified on the target cluster after > * the fork. We can stop reading after reaching the final shutdown record. > * XXX: If we supported rewinding a server that was not shut down cleanly, > * we would need to replay until the end of WAL here. > */ > if (showprogress) > pg_log_info("reading WAL in target"); > extractPageMap(datadir_target, chkptrec, lastcommontliIndex, > ControlFile_target.checkPoint, restore_command); > filemap_finalize(); so that it scans all the way up to minRecoveryPoint, instead of stopping at ControlFile_target.checkPoint. Otherwise, we will fail to rewind changes that happened after the last checkpoint. And we need to do that regardless of the "no rewind required" bug, any time we rewind a server that's in DB_SHUTDOWNED_IN_RECOVERY state. I'm surprised none of the existing tests have caught that. Am I missing something? - Heikki
I did some effort to review your patch which seems legit to me.
I think some minor things are better to be improved i.e.
1. Comment regarding
------
347 * Check for the possibility that the target is in fact a direct
348 * ancestor of the source. In that case, there is no divergent history349 * in the target that needs rewinding.
------
are better be reformulated as overall block contents are mostly cover vice versa case of a target NOT being a direct ancestor of the source. Maybe: "We need rewind in cases when .... and don't need only if the target is a direct ancestor of the source." I think it will be more understandable if it would be a commentary with descriptions of all cases in the block or no commentary before the block at all with commentaries of these cases on each if/else inside the block instead.
2. When I do the test with no patching of pg_rewind.c I get the output:
-----
# Failed test 'pg_rewind detects rewind needed stderr /(?^:rewinding from last common checkpoint at)/'# at t/007_min_recovery_point.pl line 107.
# 'pg_rewind: servers diverged at WAL location 0/3000180 on timeline 2
# pg_rewind: no rewind required
# '
# doesn't match '(?^:rewinding from last common checkpoint at)'
# Looks like you failed 1 test of 2.
t/007_min_recovery_point.pl .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/2 subtests
Test Summary Report
-------------------
t/007_min_recovery_point.pl (Wstat: 256 Tests: 2 Failed: 1)
Failed test: 2
Non-zero exit status: 1
-------
Maybe it can just give "failed" without so many details?
Also, I think Heikki's notion could be fulfilled.
Apart from this I consider the patch clean, clear, tests are passes and I'd recommend it to commit after a minor improvement described.
Thank you!
--
2020年11月10日(火) 18:07 Pavel Borisov <pashkin.elfe@gmail.com>: > > I did some effort to review your patch which seems legit to me. Thanks for the review and feedback. > I think some minor things are better to be improved i.e. > > 1. Comment regarding > ------ > 347 * Check for the possibility that the target is in fact a direct > 348 * ancestor of the source. In that case, there is no divergent history > 349 * in the target that needs rewinding. > ------ > are better be reformulated as overall block contents are mostly cover vice > versa case of a target NOT being a direct ancestor of the source. Maybe: "We > need rewind in cases when .... and don't need only if the target is a direct > ancestor of the source." I think it will be more understandable if it would be > a commentary with descriptions of all cases in the block or no commentary > before the block at all with commentaries of these cases on each if/else > inside the block instead. The comment you cite is not part of this patch. I suspect there might be some scope for improving the comments in general, but that would need to be done seperately. > 2. When I do the test with no patching of pg_rewind.c I get the output: > ----- > # Failed test 'pg_rewind detects rewind needed stderr /(?^:rewinding from last common checkpoint at)/' > # at t/007_min_recovery_point.pl line 107. > # 'pg_rewind: servers diverged at WAL location 0/3000180 on timeline 2 > # pg_rewind: no rewind required > # ' > # doesn't match '(?^:rewinding from last common checkpoint at)' > # Looks like you failed 1 test of 2. > t/007_min_recovery_point.pl .. Dubious, test returned 1 (wstat 256, 0x100) > Failed 1/2 subtests > > Test Summary Report > ------------------- > t/007_min_recovery_point.pl (Wstat: 256 Tests: 2 Failed: 1) > Failed test: 2 > Non-zero exit status: 1 > ------- > Maybe it can just give "failed" without so many details? That's just the way the tests work - an individual test has no influence over that output. > Also, I think Heikki's notion could be fulfilled. I spent a bit of time looking at that suggestion but couldn't actually verify it was an issue which needed fixing. Note that the patch may require reworking for HEAD due to changes in commit 9c4f5192f6. I'll try to take another look this week. Regards Ian Barwick -- EnterpriseDB: https://www.enterprisedb.com
> 1. Comment regarding
> ------
> 347 * Check for the possibility that the target is in fact a direct
> 348 * ancestor of the source. In that case, there is no divergent history
> 349 * in the target that needs rewinding.
> ------
> are better be reformulated as overall block contents are mostly cover vice
> versa case of a target NOT being a direct ancestor of the source. Maybe: "We
> need rewind in cases when .... and don't need only if the target is a direct
> ancestor of the source." I think it will be more understandable if it would be
> a commentary with descriptions of all cases in the block or no commentary
> before the block at all with commentaries of these cases on each if/else
> inside the block instead.
The comment you cite is not part of this patch. I suspect there might be some
Sure, it is comment describes the whole block the patch introduces changes into. If it could be rendered more relevant and anyway you are planning to revise it again, it would be great to change it also. But I don't insist.
Note that the patch may require reworking for HEAD due to changes in
commit 9c4f5192f6. I'll try to take another look this week.
Thank you!
--
On 16.11.2020 05:49, Ian Lawrence Barwick wrote: > > Note that the patch may require reworking for HEAD due to changes in > commit 9c4f5192f6. I'll try to take another look this week. > > > Regards > > Ian Barwick > > > -- > EnterpriseDB: https://www.enterprisedb.com > > > Status update for a commitfest entry. The patch is Waiting on Author for some time. As this is a bug fix, I am moving it to the next CF. Ian, are you planning to continue working on it? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Status update for a commitfest entry.
The patch is Waiting on Author for some time. As this is a bug fix, I am
moving it to the next CF.
Ian, are you planning to continue working on it?
As a reviewer, I consider the patch useful and good overall. The comments I left were purely cosmetic. It's a pity to me that this bugfix delayed for such a small reason and outdated, therefore. It would be nice to complete this fix on the next CF.
--
On 01/12/2020 16:52, Pavel Borisov wrote: > Status update for a commitfest entry. > > The patch is Waiting on Author for some time. As this is a bug fix, > I am > moving it to the next CF. > Ian, are you planning to continue working on it? > > As a reviewer, I consider the patch useful and good overall. The > comments I left were purely cosmetic. It's a pity to me that this bugfix > delayed for such a small reason and outdated, therefore. It would be > nice to complete this fix on the next CF. Yeah, we really should fix this.. On 16/11/2020 04:49, Ian Lawrence Barwick wrote: >> Also, I think Heikki's notion could be fulfilled. > > I spent a bit of time looking at that suggestion but couldn't actually > verify it was an issue which needed fixing. Attached are two patches. The first patch is your original patch, unmodified (except for a cosmetic rename of the test file). The second patch builds on that, demonstrating and fixing the issue I mentioned. It took me a while to create a repro for it, it's easily masked by incidental full-page writes or because rows created by XIDs that are not marked as committed on the other timeline are invisible, but succeeded at last. - Heikki
Attachment
On 02/12/2020 20:13, Heikki Linnakangas wrote: > On 01/12/2020 16:52, Pavel Borisov wrote: >> Status update for a commitfest entry. >> >> The patch is Waiting on Author for some time. As this is a bug fix, >> I am >> moving it to the next CF. >> Ian, are you planning to continue working on it? >> >> As a reviewer, I consider the patch useful and good overall. The comments I left were purely cosmetic. It's a pity tome that this bugfix delayed for such a small reason and outdated, therefore. It would be nice to complete this fix on thenext CF. > > Yeah, we really should fix this.. > > On 16/11/2020 04:49, Ian Lawrence Barwick wrote: >>> Also, I think Heikki's notion could be fulfilled. >> >> I spent a bit of time looking at that suggestion but couldn't actually >> verify it was an issue which needed fixing. > > Attached are two patches. The first patch is your original patch, unmodified > (except for a cosmetic rename of the test file). The second patch builds on > that, demonstrating and fixing the issue I mentioned. It took me a while to > create a repro for it, it's easily masked by incidental full-page writes or > because rows created by XIDs that are not marked as committed on the other > timeline are invisible, but succeeded at last. Aha, many thanks. I wasn't entirely sure what I was looking for there and recently haven't had the time or energy to dig any further. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 02/12/2020 15:26, Ian Barwick wrote: > On 02/12/2020 20:13, Heikki Linnakangas wrote: >> Attached are two patches. The first patch is your original patch, unmodified >> (except for a cosmetic rename of the test file). The second patch builds on >> that, demonstrating and fixing the issue I mentioned. It took me a while to >> create a repro for it, it's easily masked by incidental full-page writes or >> because rows created by XIDs that are not marked as committed on the other >> timeline are invisible, but succeeded at last. > > Aha, many thanks. I wasn't entirely sure what I was looking for there and > recently haven't had the time or energy to dig any further. Ok, pushed and backpatched this now. Thanks! - Heikki
Ok, pushed and backpatched this now.
Very nice!
Thanks to you all!
On 03/12/2020 16:49, Pavel Borisov wrote: > Ok, pushed and backpatched this now. > > Very nice! > Thanks to you all! Thanks for the review, Pavel! I just realized that I forgot to credit you in the commit message. I'm sorry. - Heikki
чт, 3 дек. 2020 г. в 19:15, Heikki Linnakangas <hlinnaka@iki.fi>:
On 03/12/2020 16:49, Pavel Borisov wrote:
> Ok, pushed and backpatched this now.
>
> Very nice!
> Thanks to you all!
Thanks for the review, Pavel! I just realized that I forgot to credit
you in the commit message. I'm sorry.
Don't worry, Heikki. No problem.
--
On 03/12/2020 16:10, Heikki Linnakangas wrote: > On 02/12/2020 15:26, Ian Barwick wrote: >> On 02/12/2020 20:13, Heikki Linnakangas wrote: >>> Attached are two patches. The first patch is your original patch, unmodified >>> (except for a cosmetic rename of the test file). The second patch builds on >>> that, demonstrating and fixing the issue I mentioned. It took me a while to >>> create a repro for it, it's easily masked by incidental full-page writes or >>> because rows created by XIDs that are not marked as committed on the other >>> timeline are invisible, but succeeded at last. >> >> Aha, many thanks. I wasn't entirely sure what I was looking for there and >> recently haven't had the time or energy to dig any further. > > Ok, pushed and backpatched this now. The buildfarm is reporting sporadic failures in the new regression test. I suspect it's because of timing issues, where a server is promoted or shut down before some data has been replicated. I'll fix that tomorrow morning. - Heikki
On 04/12/2020 00:16, Heikki Linnakangas wrote: > On 03/12/2020 16:10, Heikki Linnakangas wrote: >> On 02/12/2020 15:26, Ian Barwick wrote: >>> On 02/12/2020 20:13, Heikki Linnakangas wrote: >>>> Attached are two patches. The first patch is your original patch, unmodified >>>> (except for a cosmetic rename of the test file). The second patch builds on >>>> that, demonstrating and fixing the issue I mentioned. It took me a while to >>>> create a repro for it, it's easily masked by incidental full-page writes or >>>> because rows created by XIDs that are not marked as committed on the other >>>> timeline are invisible, but succeeded at last. >>> >>> Aha, many thanks. I wasn't entirely sure what I was looking for there and >>> recently haven't had the time or energy to dig any further. >> >> Ok, pushed and backpatched this now. > > The buildfarm is reporting sporadic failures in the new regression test. > I suspect it's because of timing issues, where a server is promoted or > shut down before some data has been replicated. I'll fix that tomorrow > morning. Fixed, I hope. It took me a while to backpatch, because small differences were needed in almost all versions, because some helpful TAP test helpers like waiting for a standby to catchup are not available in backbranches. There was one curious difference between versions 9.6 and 10. In v10, you can perform a "clean switchover" like this: 1. Shut down primary (A) with "pg_ctl -m fast". 2. Promote the standby (B) with "pg_ctl promote". 3. Reconfigure the old primary (A) as a standby, by creating recovery.conf that points to the promoted server, and start it up. But on 9.6, that leads to an error on the the repurposed primary server (A): LOG: primary server contains no more WAL on requested timeline 1 LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/30000A0 It's not clear to me why that is. It seems that the primary generates some WAL at shutdown that doesn't get replicated, before the shutdown happens. Or the standby doesn't replay that WAL before it's promoted. But we have supported "clean switchover" since 9.1, see commit 985bd7d497. When you shut down the primary, it should wait until all the WAL has been replicated, including the shutdown checkpoint. Perhaps I was just doing it wrong in the test. Or maybe there's a genuine bug in that that was fixed in v10. I worked around that in the test by re-initializing the primary standby from backup instead of just reconfiguring it as a standby, and that's good enough for this particular test, so I'm not planning to dig deeper into that myself. - Heikki