Re: pg_rewind does not rewind diverging timelines - Mailing list pgsql-hackers

From Mats Kindahl
Subject Re: pg_rewind does not rewind diverging timelines
Date
Msg-id CAN305gC0VE8zB=guccMj-7cJTW4oOAmTYCktaUKSzyOup=HHEw@mail.gmail.com
Whole thread
In response to pg_rewind does not rewind diverging timelines  (Mats Kindahl <mats.kindahl@gmail.com>)
List pgsql-hackers
On Thu, Apr 30, 2026 at 10:19 AM Mats Kindahl <mats.kindahl@gmail.com> wrote:
Hi all,

I have been playing around with various promotion scenarios to check if it is possible to lose writes in more complicated scenarios involving promotions and uses of synchronous_standby_names and decided to create a TLA+ model for streaming replication involving promotions and check those with TLC. You can find the models at [1] if you're interested.

There is one scenario that I assume is known that TLC found, but does not seem to be fixed. It is a relatively rare case, but since the fix is quite easy, I thought I'd share it with you and get feedback.

The scenario can occur if you're unlucky and have more than one crash when promoting standbys to be primaries, and goes like this:

You have three servers, S1, S2, and S3. S1 is primary and S2 and S3 are standbys. All are on timeline (TLI) 1.

1. S1 crashes
2. S1 recovers and starts promotion. It writes XLOG_END_OF_RECOVERY (EOR) for TLI 2 to the WAL.
3. S1 It manages to write some records W1 to the WAL.
4. Before the EOR is replicated to any standby, S1 crashes again. It is now on TLI 2 and has some changes that are not elsewhere.
5. S2 is promoted. It writes an EOR for TLI 2 (since it is not aware of any other timeline) to the WAL.
6. S2 writes some records W2 to WAL and now S1 has a record of TLI 2 version 1 (TLI 2.1) and S2 is on TLI 2.2.
7. S1 recovers and wants to join as a standby. You run pg_rewind to get rid of the extra data, but since S2 is also on TLI 2, pg_rewind will happily assume that both are on the same timeline.
8. S2 is now a standby but has that extra record for W2 both in the WAL and in the database.

The fix (see attached draft) is quite simple: add a UUID to the EOR and to the history file. When comparing timelines, don't only check the TLI, also check the UUID. If not both match, go back further until you find a timeline where both the TLI and the timeline UUID matches and do the usual fandango to find the good LSN to rewind to.


Here is an updated version of the patch. It seems like it is not necessary to extend the XLOG_END_OF_RECOVERY record with the UUID, just the history files. The scenario is still the same though, and can trigger diverging servers, possibly silent. I have an additional test case using a divergence going back three promotions.
--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase

Attachment

pgsql-hackers by date:

Previous
From: Soumya S Murali
Date:
Subject: Re: CREATE OR REPLACE MATERIALIZED VIEW
Next
From: Andres Freund
Date:
Subject: Re: Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.