Re: Corner-case bug in pg_rewind - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Corner-case bug in pg_rewind
Date
Msg-id d5824e45-4ad2-19c7-4c29-ceeb720efded@iki.fi
Whole thread Raw
In response to Corner-case bug in pg_rewind  (Ian Barwick <ian.barwick@2ndquadrant.com>)
Responses Re: Corner-case bug in pg_rewind
List pgsql-hackers
On 11/09/2020 09:42, Ian Barwick wrote:
> Take the following cluster with:
>    - node1 (initial primary)
>    - node2 (standby)
>    - node3 (standby)
> 
> Following activity takes place (greatly simplified from a real-world situation):
> 
> 1. node1 is shut down.
> 2. node3 is promoted
> 3. node2 is attached to node3.
> 4. node1 is attached to node3
> 5. node1 is then promoted (creating a split brain situation with
>     node1 and node3 as primaries)
> 6. node2 and node3 are shut down (in that order).
> 7. pg_rewind is executed to reset node2 so it can reattach
>     to node1 as a standby. pg_rewind claims:
> 
>      pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
>      pg_rewind: no rewind required
> 
> 8. based off that assurance, node2 is restarted with replication configuration
>     pointing to node1 - but it is unable to attach, with node2's log reporting
>     something like:
> 
>        new timeline 3 forked off current database system timeline 2
> before current recovery point X/XXXXXXX
> 
> The cause is that pg_rewind is assuming that if the node's last
> checkpoint matches the
> divergence point, no rewind is needed:
> 
>      if (chkptendrec == divergerec)
>          rewind_needed = false;
> 
> but in this case there *are* records beyond the last checkpoint, which can be
> inferred from "minRecoveryPoint" - but this is not checked.

Yep, I think you're right.

> Attached patch addresses this. It includes a test, which doesn't make use of
> the RewindTest module, as that hard-codes a primary and a standby, while here
> three nodes are needed (I can't come up with a situation where this can be
> reproduced with only two nodes). The test sets "wal_keep_size" so would need
> modification for Pg12 and earlier.

I think we also need to change the extractPageMap() call:

>     /*
>      * Read the target WAL from last checkpoint before the point of fork, to
>      * extract all the pages that were modified on the target cluster after
>      * the fork. We can stop reading after reaching the final shutdown record.
>      * XXX: If we supported rewinding a server that was not shut down cleanly,
>      * we would need to replay until the end of WAL here.
>      */
>     if (showprogress)
>         pg_log_info("reading WAL in target");
>     extractPageMap(datadir_target, chkptrec, lastcommontliIndex,
>                    ControlFile_target.checkPoint, restore_command);
>     filemap_finalize();

so that it scans all the way up to minRecoveryPoint, instead of stopping 
at ControlFile_target.checkPoint. Otherwise, we will fail to rewind 
changes that happened after the last checkpoint. And we need to do that 
regardless of the "no rewind required" bug, any time we rewind a server 
that's in DB_SHUTDOWNED_IN_RECOVERY state. I'm surprised none of the 
existing tests have caught that. Am I missing something?

- Heikki



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Skip ExecCheckRTPerms in CTAS with no data
Next
From: Ashutosh Bapat
Date:
Subject: Re: Disable WAL logging to speed up data loading