Thread: Unable to start replica after failover

Unable to start replica after failover

From

"Lahnov, Igor"

Date:

29 July 2022, 15:01:44

Hi,

We're using asynchronous streaming replication together with WAL archiving and recovery (we've set these parameters: archive_command, restore_command, recovery_target_timeline=latest).

To manage backup and recovery, we're using pg_probackup.

We have a 3-node cluster managed by Patroni on Raft.

While running our cluster, we encounter a recurring problem: after a failover, the former leader cannot recover neither via the tools provided by PostgreSQL nor with the Patroni tools (pg_rewind).

We've seen a number of different WAL reading errors. For example:

· "could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or

· "could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, offset 3448832".

There have been some other variations but all such errors have the same cause:

1. At startup, the former leader does not immediately get access to the backup. As a result, the recovery process uses the local WAL, which inevitably leads to a split brain since in most cases there is a lag in replication after a failover.

2. Next, the former leader finally manages to read the latest log from the backup, and that log has a higher priority than the local log, even though the recovery with the local log has already occurred.

As a result, we see the following:

"new timeline 3 forked off current database system timeline 2 before current recovery point 54E/FB348150".

When pg_rewind is run, it also uses the log from the backup (the lagging log from the new leader) instead of the partial log with which the former leader has already been started.

Next, pg_rewind returns errors while reading the log from the backup back, looking for the last checkpoint, which is quite reasonable because, once a new leader starts, the point of divergence normally ends up in the next timeline and the previous timeline's backup log does not have a block with the LSN of the divergence.

Also, in this case, it is impossible to build a map of the pages that have been changed to roll back the changes using pg_rewind.

To avoid the problem, we decided to stop using restore_command. Could you please let us know if there is a better solution to the problem we've described?

Re: Unable to start replica after failover

From

Kyotaro Horiguchi

Date:

04 August 2022, 08:30:16

At Fri, 29 Jul 2022 15:01:44 +0000, "Lahnov, Igor" <Igor.Lahnov@nexign.com> wrote in 
> *         "could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or
> *         "could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment
000000050000054E000000FB,offset 3448832".
 
..
> To avoid the problem, we decided to stop using restore_command. Could you please let us know if there is a better
solutionto the problem we've described?
 

Perhaps almost none of us don't see what is happning there, since you didn't give us sufficient information on the
configurationand exact steps.
 

But roughly it looks like shuffling/mixing of WAL files among several
systems (or WAL archives) with different histories.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Unable to start replica after failover

From

"Lahnov, Igor"

Date:

15 August 2022, 12:09:56

What additional information is needed?

Re: Unable to start replica after failover

From

Kyotaro Horiguchi

Date:

23 August 2022, 06:07:53

> What additional information is needed?

Usually server logs and the output of pg_rewind at the trouble time
are needed as the first step.

> Next, pg_rewind returns errors while reading the log from the backup
> back, looking for the last checkpoint, which is quite reasonable
> because, once a new leader starts, the point of divergence normally
> ends up in the next timeline and the previous timeline's backup log
> does not have a block with the LSN of the divergence.

That sounds like pg_rewind is a crap.

pg_rewind reads timeline history files from the both servers to find
the last timeline up to where the two servers share the same history
then determine the divergence point at the latest LSN where the two
servers are known to share. Then it overwrites the pages modified
since the common checkpoint until the last (shutdown) checkpoint on
the previous leader that are modified in the *previous* timeline on the
former leader by the data of the same pages *on the new leader*. No
need for page data from the older timeline. If nothing's going wrong,
pg_rewind is not expected to face the situation of:

> could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50
> could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment
000000050000054E000000FB,
 

Which means the WAL files are somehow broken.

> When pg_rewind is run, it also uses the log from the backup (the
> lagging log from the new leader) instead of the partial log with
> which the former leader has already been started.

I don't see how come the former leader doesn't have access to the
partial log (or the latest WAL file, I suppose)? It is essential for
pg_rewind to work (since it exists nowhere other than there) and it
must be in pg_wal directory unless someone removed it.

Thus, I think we need the exact steps you and your system took after
the failover happened about postgresql.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Unable to start replica after failover

From

"Lahnov, Igor"

Date:

23 August 2022, 12:31:59

We know what the problem is, but don't know how to solve it correctly.

After failover, *new leader* promoted and read local partial log to LSN 54E/FB348118

->
2022-05-23 01:47:52.124 [12088] LOG: record with incorrect prev-link 0/100 at 54E/FB348118
2022-05-23 01:47:52.124 [12088] LOG: redo done at 54E/FB3480F0 system usage: CPU: user: 321.21 s, system: 144.77 s,
elapsed:354977.44 s 
rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 54E/FB348118, prev 54E/FB3480F0, desc: END_OF_RECOVERY tli 3; prev tli 2;
time2022-05-23 01:47:53.342100  
2022-05-23 01:47:52.900 [12088] LOG: selected new timeline ID: 3
2022-05-23 01:47:53.658 [12084] LOG: database system is ready to accept connections
<-

The *previous leader* started after a failure and read the local partial log to LSN 54E/FB348150

->
2022-05-23 01:50:12.945 [122891] LOG: redo starts at 54E/F97F4878
2022-05-23 01:50:13.010 [122891] LOG: invalid record length at 54E/FB348150: wanted 24, got 0
2022-05-23 01:50:13.011 [122891] LOG: consistent recovery state reached at 54E/FB348150
2022-05-23 01:50:13.012 [122883] LOG: database system is ready to accept read-only connections
<-

Our 'restore_command' on *previous leader* restores a partial file from archive (from *new leader*)

->
2022-05-23 01:50:14 [123730]: [1-1]: INFO: pg_probackup archive-get WAL file: 000000020000054E000000FB, remote: ssh,
threads:1/1, batch: 1 
<-

And this file has a higher priority than the original local partial file on the * previous leader*. And this leads to
theproblem: 

---
2022-05-23 01:50:14.448 [122891] LOG: new timeline 3 forked off current database system timeline 2 before current
recoverypoint 54E/FB348150 
---

And pg_rewind tries to use this file (from *new leader*) to build a map of pages that have changed since the last
checkpoint

---
2022-05-23 01:51:32,202 INFO: Lock owner: pg51; I am pg01
2022-05-23 01:51:32,209 INFO: Local timeline=2 lsn=54E/FB348150
2022-05-23 01:51:32,252 INFO: master_timeline=3
2022-05-23 01:51:32,254 INFO: master: history=1 53E/FD65D298 no recovery target specified
2 54E/FB348118 no recovery target specified
2022-05-23 01:51:32,499 INFO: running pg_rewind from pg51
2022-05-23 01:51:32,574 INFO: running pg_rewind from dbname=postgres user=postgres host= IP port=5432
target_session_attrs=read-write
2022-05-23 01:51:32,640 INFO: pg_rewind exit code=1
2022-05-23 01:51:32,640 INFO: stdout=
2022-05-23 01:51:32,640 INFO: stderr=pg_rewind: servers diverged at WAL location 54E/FB348118 on timeline 2
pg_rewind: fatal: could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment
000000020000054E000000FB,offset 3448832 
---

We checked this log (from *new leader*), and when direct reading, the output like this:
---
record with incorrect prev-link 0/100 at 54E/FB348118
----
when backward reading:
---
54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000020000054E000000FB, offset 3448832
---

Now we have disabled restore_command, but I think it's not best solution. What the correct solution in this case?

Re: Unable to start replica after failover

From

Alexander Kukushkin

Date:

24 August 2022, 05:25:41

Hi,

On Tue, 23 Aug 2022 at 16:31, Lahnov, Igor <Igor.Lahnov@nexign.com> wrote:

Our 'restore_command' on *previous leader* restores a partial file from archive (from *new leader*)

->
2022-05-23 01:50:14 [123730]: [1-1]: INFO: pg_probackup archive-get WAL file: 000000020000054E000000FB, remote: ssh, threads: 1/1, batch: 1
<-

Does it really restore the *partial* file? And why would it do that? The old primary didn't have a chance to archive 000000020000054E000000FB, the new primary archived it as 000000020000054E000000FB.partial after promotion.

Postgres is requesting the 000000020000054E000000FB file, but somehow gets a 000000020000054E000000FB.partial instead. Why?

IMO, pg_probackup does something weird.

Regards,

Alexander Kukushkin

RE: Unable to start replica after failover

From

"Lahnov, Igor"

Date:

24 August 2022, 11:37:09

Hi,

Yes, the *patial* from the *new leader* is restored to *last leader* and renamed to 000000020000054E000000FB, without *partial* postfix.

>>Postgres asks for file 000000020000054E000000FB but somehow gets

>>000000020000054E000000FB.partial instead. Why?

Yes, Postgres asks for 000000020000054E000000FB and gets renamed 000000020000054E000000FB.partial (without *partial* postfix).

And, finally, the *last leader* has a local 000000020000054E000000FB.partial, with the segments needed for recovery, and 000000020000054E000000FB ( renamed 000000020000054E000000FB.partial from new leader). But 000000020000054E000000FB has a higher priority than 000000020000054E000000FB.partial

Also, the *new leader* archives full 000000030000054E000000FB of the new timeline. And this file contains the same as 000000020000054E000000FB data.

Re: Unable to start replica after failover

From

Alexander Kukushkin

Date:

25 August 2022, 05:23:17

Hi,

On Wed, 24 Aug 2022 at 13:37, Lahnov, Igor <Igor.Lahnov@nexign.com> wrote:

Yes, Postgres asks for 000000020000054E000000FB and gets renamed 000000020000054E000000FB.partial (without *partial* postfix).

But why? This is totally weird and unexpected behavior. Why pg_probackup is doing this?

Regards,

Alexander Kukushkin

RE: Unable to start replica after failover

From

"Lahnov, Igor"

Date:

26 August 2022, 11:43:47

I can't answer your question. What do you think, the recovery from the archive should work in this case?

1. the partial file should not be restored at all cases

2. the partial file should recover as a partial and replace the local partial

3. recovery command, should return a conflict - file already exists

Re: Unable to start replica after failover

From

Alexander Kukushkin

Date:

30 August 2022, 09:04:51

Hi Igor,

On Fri, 26 Aug 2022 at 13:43, Lahnov, Igor <Igor.Lahnov@nexign.com> wrote:

I can't answer your question. What do you think, the recovery from the archive should work in this case?
1. the partial file should not be restored at all cases
2. the partial file should recover as a partial and replace the local partial
3. recovery command, should return a conflict - file already exists

The partial file should never be restored automatically.

I have no clue why pg_probackup developers decided to do it.

Postgres never requests the partial file using the restore_command, instead, it requests a file from the next timeline.

Regards,

Alexander Kukushkin

RE: Unable to start replica after failover

From

"Lahnov, Igor"

Date:

06 September 2022, 06:46:38

As far as I understand, according to the logs, the last leader does not yet know about the new timeline

and it is trying to download the full log from the previous timeline.

It seems there should be a conflict that the partial file already exists locally when restoring in this case, but this does not happen.

And the partial file from the new leader loaded as a full log.

What do you think it is possible to add a check to the restore command, that a partial or full file already exists?

Or is disabling the restore command a possible solution in this case?

Re: Unable to start replica after failover

From

Alexander Kukushkin

Date:

06 September 2022, 07:16:03

On Tue, Sep 6, 2022, 08:46 Lahnov, Igor <Igor.Lahnov@nexign.com> wrote:

What do you think it is possible to add a check to the restore command, that a partial or full file already exists?
Or is disabling the restore command a possible solution in this case?

My opinion didn't change, pg_probackup does a weird thing. It shouldn't restore the partial file instead of the normal when the normal file doesn't exists.

Regards,

Alexander Kukushkin