Re: Standby recovers records from wrong timeline - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Standby recovers records from wrong timeline
Date
Msg-id 20221020.172957.1000540024644914902.horikyota.ntt@gmail.com
Whole thread Raw
In response to Standby recovers records from wrong timeline  (Ants Aasma <ants@cybertec.at>)
Responses Re: Standby recovers records from wrong timeline  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: Standby recovers records from wrong timeline  (Ants Aasma <ants@cybertec.at>)
List pgsql-hackers
At Wed, 19 Oct 2022 18:50:09 +0300, Ants Aasma <ants@cybertec.at> wrote in 
> When standby is recovering to a timeline that doesn't have any segments
> archived yet it will just blindly blow past the timeline switch point and
> keeps on recovering on the old timeline. Typically that will eventually
> result in an error about incorrect prev-link, but under unhappy
> circumstances can result in standby silently having different contents.
> 
> Attached is a shell script that reproduces the issue. Goes back to at least
> v12, probably longer.
> 
> I think we should be keeping track of where the current replay timeline is
> going to end and not read any records past it on the old timeline. Maybe
> while at it, we should also track that the next record should be a
> checkpoint record for the timeline switch and error out if not. Thoughts?

primary_restored did a time-travel to past a bit because of the
recovery_target=immediate. In other words, the primary_restored and
the replica diverge. I don't think it is legit to connect a diverged
standby to a primary.

So, about the behavior in doubt, it is the correct behavior to
seemingly ignore the history file in the archive. Recovery assumes
that the first half of the first segment of the new timeline is the
same with the same segment of the old timeline (.partial) so it is
legit to read the <tli=1,seg=2> file til the end and that causes the
replica goes beyond the divergence point.

As you know, when new primary starts a diverged history, the
recommended way is to blow (or stash) away the archive, then take a
new backup from the running primary.

If you don't want to trash all the past backups, remove the archived
files equals to or after the divergence point before starting the
standby. They're <tli=2,seg=2,3> in this case. Also you must remove
replica/pg_wal/<tli=2,seg=2> before starting the replica. That file
causes recovery run beyond the divergence point before fetching from
archive or stream.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: thinko in basic_archive.c
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Standby recovers records from wrong timeline