RE: Unable to start replica after failover - Mailing list pgsql-general
From | Lahnov, Igor |
---|---|
Subject | RE: Unable to start replica after failover |
Date | |
Msg-id | ecfe8dcccdae4efb9262f633ad0070d6@nexign.com Whole thread Raw |
In response to | Re: Unable to start replica after failover (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: Unable to start replica after failover
|
List | pgsql-general |
We know what the problem is, but don't know how to solve it correctly. After failover, *new leader* promoted and read local partial log to LSN 54E/FB348118 -> 2022-05-23 01:47:52.124 [12088] LOG: record with incorrect prev-link 0/100 at 54E/FB348118 2022-05-23 01:47:52.124 [12088] LOG: redo done at 54E/FB3480F0 system usage: CPU: user: 321.21 s, system: 144.77 s, elapsed:354977.44 s rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn: 54E/FB348118, prev 54E/FB3480F0, desc: END_OF_RECOVERY tli 3; prev tli 2; time2022-05-23 01:47:53.342100 2022-05-23 01:47:52.900 [12088] LOG: selected new timeline ID: 3 2022-05-23 01:47:53.658 [12084] LOG: database system is ready to accept connections <- The *previous leader* started after a failure and read the local partial log to LSN 54E/FB348150 -> 2022-05-23 01:50:12.945 [122891] LOG: redo starts at 54E/F97F4878 2022-05-23 01:50:13.010 [122891] LOG: invalid record length at 54E/FB348150: wanted 24, got 0 2022-05-23 01:50:13.011 [122891] LOG: consistent recovery state reached at 54E/FB348150 2022-05-23 01:50:13.012 [122883] LOG: database system is ready to accept read-only connections <- Our 'restore_command' on *previous leader* restores a partial file from archive (from *new leader*) -> 2022-05-23 01:50:14 [123730]: [1-1]: INFO: pg_probackup archive-get WAL file: 000000020000054E000000FB, remote: ssh, threads:1/1, batch: 1 <- And this file has a higher priority than the original local partial file on the * previous leader*. And this leads to theproblem: --- 2022-05-23 01:50:14.448 [122891] LOG: new timeline 3 forked off current database system timeline 2 before current recoverypoint 54E/FB348150 --- And pg_rewind tries to use this file (from *new leader*) to build a map of pages that have changed since the last checkpoint --- 2022-05-23 01:51:32,202 INFO: Lock owner: pg51; I am pg01 2022-05-23 01:51:32,209 INFO: Local timeline=2 lsn=54E/FB348150 2022-05-23 01:51:32,252 INFO: master_timeline=3 2022-05-23 01:51:32,254 INFO: master: history=1 53E/FD65D298 no recovery target specified 2 54E/FB348118 no recovery target specified 2022-05-23 01:51:32,499 INFO: running pg_rewind from pg51 2022-05-23 01:51:32,574 INFO: running pg_rewind from dbname=postgres user=postgres host= IP port=5432 target_session_attrs=read-write 2022-05-23 01:51:32,640 INFO: pg_rewind exit code=1 2022-05-23 01:51:32,640 INFO: stdout= 2022-05-23 01:51:32,640 INFO: stderr=pg_rewind: servers diverged at WAL location 54E/FB348118 on timeline 2 pg_rewind: fatal: could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000020000054E000000FB,offset 3448832 --- We checked this log (from *new leader*), and when direct reading, the output like this: --- record with incorrect prev-link 0/100 at 54E/FB348118 ---- when backward reading: --- 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000020000054E000000FB, offset 3448832 --- Now we have disabled restore_command, but I think it's not best solution. What the correct solution in this case?
pgsql-general by date: