Re: Corner-case bug in pg_rewind - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Corner-case bug in pg_rewind |
Date | |
Msg-id | d5824e45-4ad2-19c7-4c29-ceeb720efded@iki.fi Whole thread Raw |
In response to | Corner-case bug in pg_rewind (Ian Barwick <ian.barwick@2ndquadrant.com>) |
Responses |
Re: Corner-case bug in pg_rewind
|
List | pgsql-hackers |
On 11/09/2020 09:42, Ian Barwick wrote: > Take the following cluster with: > - node1 (initial primary) > - node2 (standby) > - node3 (standby) > > Following activity takes place (greatly simplified from a real-world situation): > > 1. node1 is shut down. > 2. node3 is promoted > 3. node2 is attached to node3. > 4. node1 is attached to node3 > 5. node1 is then promoted (creating a split brain situation with > node1 and node3 as primaries) > 6. node2 and node3 are shut down (in that order). > 7. pg_rewind is executed to reset node2 so it can reattach > to node1 as a standby. pg_rewind claims: > > pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2 > pg_rewind: no rewind required > > 8. based off that assurance, node2 is restarted with replication configuration > pointing to node1 - but it is unable to attach, with node2's log reporting > something like: > > new timeline 3 forked off current database system timeline 2 > before current recovery point X/XXXXXXX > > The cause is that pg_rewind is assuming that if the node's last > checkpoint matches the > divergence point, no rewind is needed: > > if (chkptendrec == divergerec) > rewind_needed = false; > > but in this case there *are* records beyond the last checkpoint, which can be > inferred from "minRecoveryPoint" - but this is not checked. Yep, I think you're right. > Attached patch addresses this. It includes a test, which doesn't make use of > the RewindTest module, as that hard-codes a primary and a standby, while here > three nodes are needed (I can't come up with a situation where this can be > reproduced with only two nodes). The test sets "wal_keep_size" so would need > modification for Pg12 and earlier. I think we also need to change the extractPageMap() call: > /* > * Read the target WAL from last checkpoint before the point of fork, to > * extract all the pages that were modified on the target cluster after > * the fork. We can stop reading after reaching the final shutdown record. > * XXX: If we supported rewinding a server that was not shut down cleanly, > * we would need to replay until the end of WAL here. > */ > if (showprogress) > pg_log_info("reading WAL in target"); > extractPageMap(datadir_target, chkptrec, lastcommontliIndex, > ControlFile_target.checkPoint, restore_command); > filemap_finalize(); so that it scans all the way up to minRecoveryPoint, instead of stopping at ControlFile_target.checkPoint. Otherwise, we will fail to rewind changes that happened after the last checkpoint. And we need to do that regardless of the "no rewind required" bug, any time we rewind a server that's in DB_SHUTDOWNED_IN_RECOVERY state. I'm surprised none of the existing tests have caught that. Am I missing something? - Heikki
pgsql-hackers by date: