Unable to start replica after failover - Mailing list pgsql-general

From Lahnov, Igor
Subject Unable to start replica after failover
Date
Msg-id c0085e0f801d4bda9e50b46c496311b3@nexign.com
Whole thread Raw
Responses Re: Unable to start replica after failover  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-general

Hi,

We're using asynchronous streaming replication together with WAL archiving and recovery (we've set these parameters: archive_commandrestore_commandrecovery_target_timeline=latest).

To manage backup and recovery, we're using pg_probackup.

We have a 3-node cluster managed by Patroni on Raft.

While running our cluster, we encounter a recurring problem: after a failover, the former leader cannot recover neither via the tools provided by PostgreSQL nor with the Patroni tools (pg_rewind).

We've seen a number of different WAL reading errors. For example:

·         "could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or

·         "could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, offset 3448832".

There have been some other variations but all such errors have the same cause:

1.       At startup, the former leader does not immediately get access to the backup. As a result, the recovery process uses the local WAL, which inevitably leads to a split brain since in most cases there is a lag in replication after a failover.

2.       Next, the former leader finally manages to read the latest log from the backup, and that log has a higher priority than the local log, even though the recovery with the local log has already occurred.

As a result, we see the following:

"new timeline 3 forked off current database system timeline 2 before current recovery point 54E/FB348150".

When pg_rewind is run, it also uses the log from the backup (the lagging log from the new leader) instead of the partial log with which the former leader has already been started.

Next, pg_rewind returns errors while reading the log from the backup back, looking for the last checkpoint, which is quite reasonable because, once a new leader starts, the point of divergence normally ends up in the next timeline and the previous timeline's backup log does not have a block with the LSN of the divergence.

Also, in this case, it is impossible to build a map of the pages that have been changed to roll back the changes using pg_rewind.

 

To avoid the problem, we decided to stop using restore_command. Could you please let us know if there is a better solution to the problem we've described?

 

pgsql-general by date:

Previous
From: Gianni Ceccarelli
Date:
Subject: Re: Syntax error when combining --set and --command has me stumped
Next
From: Bryn Llewellyn
Date:
Subject: Re: « The PL/pgSQL interpreter parses the function's source text and produces an internal binary instruction tree... »