Thread: Standby recovers records from wrong timeline

Standby recovers records from wrong timeline

From
Ants Aasma
Date:
When standby is recovering to a timeline that doesn't have any segments archived yet it will just blindly blow past the timeline switch point and keeps on recovering on the old timeline. Typically that will eventually result in an error about incorrect prev-link, but under unhappy circumstances can result in standby silently having different contents.

Attached is a shell script that reproduces the issue. Goes back to at least v12, probably longer.

I think we should be keeping track of where the current replay timeline is going to end and not read any records past it on the old timeline. Maybe while at it, we should also track that the next record should be a checkpoint record for the timeline switch and error out if not. Thoughts?

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com
Attachment

Re: Standby recovers records from wrong timeline

From
Kyotaro Horiguchi
Date:
At Wed, 19 Oct 2022 18:50:09 +0300, Ants Aasma <ants@cybertec.at> wrote in 
> When standby is recovering to a timeline that doesn't have any segments
> archived yet it will just blindly blow past the timeline switch point and
> keeps on recovering on the old timeline. Typically that will eventually
> result in an error about incorrect prev-link, but under unhappy
> circumstances can result in standby silently having different contents.
> 
> Attached is a shell script that reproduces the issue. Goes back to at least
> v12, probably longer.
> 
> I think we should be keeping track of where the current replay timeline is
> going to end and not read any records past it on the old timeline. Maybe
> while at it, we should also track that the next record should be a
> checkpoint record for the timeline switch and error out if not. Thoughts?

primary_restored did a time-travel to past a bit because of the
recovery_target=immediate. In other words, the primary_restored and
the replica diverge. I don't think it is legit to connect a diverged
standby to a primary.

So, about the behavior in doubt, it is the correct behavior to
seemingly ignore the history file in the archive. Recovery assumes
that the first half of the first segment of the new timeline is the
same with the same segment of the old timeline (.partial) so it is
legit to read the <tli=1,seg=2> file til the end and that causes the
replica goes beyond the divergence point.

As you know, when new primary starts a diverged history, the
recommended way is to blow (or stash) away the archive, then take a
new backup from the running primary.

If you don't want to trash all the past backups, remove the archived
files equals to or after the divergence point before starting the
standby. They're <tli=2,seg=2,3> in this case. Also you must remove
replica/pg_wal/<tli=2,seg=2> before starting the replica. That file
causes recovery run beyond the divergence point before fetching from
archive or stream.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Standby recovers records from wrong timeline

From
Kyotaro Horiguchi
Date:
Sorry, a correction needed..

At Thu, 20 Oct 2022 17:29:57 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 19 Oct 2022 18:50:09 +0300, Ants Aasma <ants@cybertec.at> wrote in 
> > When standby is recovering to a timeline that doesn't have any segments
> > archived yet it will just blindly blow past the timeline switch point and
> > keeps on recovering on the old timeline. Typically that will eventually
> > result in an error about incorrect prev-link, but under unhappy
> > circumstances can result in standby silently having different contents.
> > 
> > Attached is a shell script that reproduces the issue. Goes back to at least
> > v12, probably longer.
> > 
> > I think we should be keeping track of where the current replay timeline is
> > going to end and not read any records past it on the old timeline. Maybe
> > while at it, we should also track that the next record should be a
> > checkpoint record for the timeline switch and error out if not. Thoughts?
> 
> primary_restored did a time-travel to past a bit because of the
> recovery_target=immediate. In other words, the primary_restored and
> the replica diverge. I don't think it is legit to connect a diverged
> standby to a primary.
> 
> So, about the behavior in doubt, it is the correct behavior to
> seemingly ignore the history file in the archive. Recovery assumes
> that the first half of the first segment of the new timeline is the
> same with the same segment of the old timeline (.partial) so it is
> legit to read the <tli=1,seg=2> file til the end and that causes the
> replica goes beyond the divergence point.
> 
> As you know, when new primary starts a diverged history, the
> recommended way is to blow (or stash) away the archive, then take a
> new backup from the running primary.
> 
> If you don't want to trash all the past backups, remove the archived
> files equals to or after the divergence point before starting the
> standby. They're <tli=2,seg=2,3> in this case. Also you must remove

<tli=2,seg=2,3> => <tli=1,seg=2,3>

> replica/pg_wal/<tli=2,seg=2> before starting the replica. That file
> causes recovery run beyond the divergence point before fetching from
> archive or stream.

<tli=2,seg=2>  => <tli=1,seg=2> 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Standby recovers records from wrong timeline

From
Kyotaro Horiguchi
Date:
Forgot a caveat.

At Thu, 20 Oct 2022 17:34:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 19 Oct 2022 18:50:09 +0300, Ants Aasma <ants@cybertec.at> wrote in 
> > When standby is recovering to a timeline that doesn't have any segments
> > archived yet it will just blindly blow past the timeline switch point and
> > keeps on recovering on the old timeline. Typically that will eventually
> > result in an error about incorrect prev-link, but under unhappy
> > circumstances can result in standby silently having different contents.
> > 
> > Attached is a shell script that reproduces the issue. Goes back to at least
> > v12, probably longer.
> > 
> > I think we should be keeping track of where the current replay timeline is
> > going to end and not read any records past it on the old timeline. Maybe
> > while at it, we should also track that the next record should be a
> > checkpoint record for the timeline switch and error out if not. Thoughts?
> 
> primary_restored did a time-travel to past a bit because of the
> recovery_target=immediate. In other words, the primary_restored and
> the replica diverge. I don't think it is legit to connect a diverged
> standby to a primary.
> 
> So, about the behavior in doubt, it is the correct behavior to
> seemingly ignore the history file in the archive. Recovery assumes
> that the first half of the first segment of the new timeline is the
> same with the same segment of the old timeline (.partial) so it is
> legit to read the <tli=1,seg=2> file til the end and that causes the
> replica goes beyond the divergence point.
> 
> As you know, when new primary starts a diverged history, the
> recommended way is to blow (or stash) away the archive, then take a
> new backup from the running primary.
> 
> If you don't want to trash all the past backups, remove the archived
> files equals to or after the divergence point before starting the
> standby. They're <tli=1,seg=2,3> in this case. Also you must remove
> replica/pg_wal/<tli=1,seg=2> before starting the replica. That file
> causes recovery run beyond the divergence point before fetching from
> archive or stream.

The reason this is workable is (as far as I can see) using
recovery_target=immediate to stop replication and the two clusters
share the completely identical disk image. Otherwise this steps
results in a broken standby.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Standby recovers records from wrong timeline

From
Ants Aasma
Date:
On Thu, 20 Oct 2022 at 11:30, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> primary_restored did a time-travel to past a bit because of the
> recovery_target=immediate. In other words, the primary_restored and
> the replica diverge. I don't think it is legit to connect a diverged
> standby to a primary.

primary_restored did timetravel to the past, as we're doing PITR on the
primary that's the expected behavior. However replica is not diverged,
it's a copy of the exact same basebackup. The usecase is restoring a
cluster from backup using PITR and using the same backup to create a
standby. Currently this breaks when primary has not yet archived any
segments.

> So, about the behavior in doubt, it is the correct behavior to
> seemingly ignore the history file in the archive. Recovery assumes
> that the first half of the first segment of the new timeline is the
> same with the same segment of the old timeline (.partial) so it is
> legit to read the <tli=1,seg=2> file til the end and that causes the
> replica goes beyond the divergence point.

What is happening is that primary_restored has a timeline switch at
tli 2, lsn 0/2000100, and the next insert record starts in the same
segment. Replica is starting on the same backup on timeline 1, tries to
find tli 2 seg 2, which is not archived yet, so falls back to tli 1 seg 2
and replays tli 1 seg 2 continuing to tli seg 3, then connects to primary
and starts applying wal starting from tli 2 seg 4. To me that seems
completely broken.

> As you know, when new primary starts a diverged history, the
> recommended way is to blow (or stash) away the archive, then take a
> new backup from the running primary.

My understanding is that backup archives are supposed to remain valid
even after PITR or equivalently a lagging standby promoting.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com



Re: Standby recovers records from wrong timeline

From
Kyotaro Horiguchi
Date:
At Thu, 20 Oct 2022 14:44:40 +0300, Ants Aasma <ants@cybertec.at> wrote in 
> My understanding is that backup archives are supposed to remain valid
> even after PITR or equivalently a lagging standby promoting.

Sorry, I was dim because of maybe catching a cold:p

On second thought. everything works fine if the first segment of the
new timeline is archived in this case. So the problem here is whether
recovery should wait for a known new timline when no segment on the
new timeline is available yet.  As you say, I think it is sensible
that recovery waits at the divergence LSN for the first segment on the
new timeline before proceeding on the same timeline.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Standby recovers records from wrong timeline

From
Kyotaro Horiguchi
Date:
At Fri, 21 Oct 2022 16:45:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 20 Oct 2022 14:44:40 +0300, Ants Aasma <ants@cybertec.at> wrote in 
> > My understanding is that backup archives are supposed to remain valid
> > even after PITR or equivalently a lagging standby promoting.
> 
> Sorry, I was dim because of maybe catching a cold:p
> 
> On second thought. everything works fine if the first segment of the
> new timeline is archived in this case. So the problem here is whether
> recovery should wait for a known new timline when no segment on the
> new timeline is available yet.  As you say, I think it is sensible
> that recovery waits at the divergence LSN for the first segment on the
> new timeline before proceeding on the same timeline.

It is simpler than anticipated.  Just not descending timelines when
latest works. It dones't consider the case of explict target timlines
so it's just a PoC.  (So this doesn't work if recovery_target_timeline
is set to 2 for the "standby" in the repro.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: Standby recovers records from wrong timeline

From
Kyotaro Horiguchi
Date:
At Fri, 21 Oct 2022 17:12:45 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> latest works. It dones't consider the case of explict target timlines
> so it's just a PoC.  (So this doesn't work if recovery_target_timeline
> is set to 2 for the "standby" in the repro.)

So, finally I noticed that the function XLogFileReadAnyTLI is not
needed at all if we are going this direction.

Regardless of recvoery_target_timeline is latest or any explicit
imeline id or checkpoint timeline, what we can do to reach the target
timline is just to follow the history file's direction.

If segments are partly gone while reading on a timeline, a segment on
the older timelines is just a crap since it should be incompatible.

So.. I'm at a loss about what the function is for.

Please anyone tell me why do we need the behavior of
XLogFileReadAnyTLI() at all?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Standby recovers records from wrong timeline

From
Kyotaro Horiguchi
Date:
At Fri, 21 Oct 2022 17:44:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 21 Oct 2022 17:12:45 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > latest works. It dones't consider the case of explict target timlines
> > so it's just a PoC.  (So this doesn't work if recovery_target_timeline
> > is set to 2 for the "standby" in the repro.)
> 
> So, finally I noticed that the function XLogFileReadAnyTLI is not
> needed at all if we are going this direction.
> 
> Regardless of recvoery_target_timeline is latest or any explicit
> imeline id or checkpoint timeline, what we can do to reach the target
> timline is just to follow the history file's direction.
> 
> If segments are partly gone while reading on a timeline, a segment on
> the older timelines is just a crap since it should be incompatible.
> 
> So.. I'm at a loss about what the function is for.
> 
> Please anyone tell me why do we need the behavior of
> XLogFileReadAnyTLI() at all?

It is introduced by 1bb2558046. And the behavior dates back to 2042b3428d.

Hmmm..  XLogFileRead() at the time did essentially the same thing to
the current XLogFileReadAnyTLI.  At that time the expectedTL*I*s
contained only timeline IDs.  Thus it seems to me, at that time,
recovery assumed that it is fine with reading the segment on the
greatest available timeline in the TLI list at every mement. (Mmm. I
cannot describe this precise enough....)  In other words it did not
intend to use the segments on the older timelines than expected as the
replacement of the segment on the correct timelnie.

If this is correct (I hople the description above makes sense), now
that we can determine the exact TLI to read for the specified segno,
we don't need to descend to older timelines. In other words, the
current XLogFileReadAnyTLI() should be just XLogFileReadOnHistory(),
which reads a segment of the exact timeline calculated from the
expectedTLEs and the segno.

I'm going to work in this direction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Standby recovers records from wrong timeline

From
Ants Aasma
Date:
On Fri, 21 Oct 2022 at 11:44, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Fri, 21 Oct 2022 17:12:45 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > latest works. It dones't consider the case of explict target timlines
> > so it's just a PoC.  (So this doesn't work if recovery_target_timeline
> > is set to 2 for the "standby" in the repro.)
>
> So, finally I noticed that the function XLogFileReadAnyTLI is not
> needed at all if we are going this direction.
>
> Regardless of recvoery_target_timeline is latest or any explicit
> imeline id or checkpoint timeline, what we can do to reach the target
> timline is just to follow the history file's direction.
>
> If segments are partly gone while reading on a timeline, a segment on
> the older timelines is just a crap since it should be incompatible.

I came to the same conclusion. I adjusted XLogFileReadAnyTLI to not use any
timeline that ends within the segment (attached patch). At this point the
name of the function becomes really wrong, XLogFileReadCorrectTLI or
something to that effect would be much more descriptive and the code could
be simplified.

However I'm not particularly happy with this approach as it will not use
valid WAL if that is not available. Consider scenario of a cascading
failure. Node A has a hard failure, then node B promotes, archives history
file, but doesn't see enough traffic to archive a full segment before
failing itself. While this is happening we restore node A from backup and
start it up as a standby.

If node b fails before node A has a chance to connect then either we are
continuing recovery on the wrong timeline (current behavior) or we will
not try to recover the first portion of the archived WAL file (with patch).

So I think the correct approach would still be to have ReadRecord() or
ApplyWalRecord() determine that switching timelines is needed.

-- 
Ants Aasma
www.cybertec-postgresql.com

Attachment