RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted - Mailing list pgsql-bugs

From egashira.yusuke@fujitsu.com
Subject RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted
Date
Msg-id OSBPR01MB5029194FC5CAA1400DF842C9FF7A9@OSBPR01MB5029.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-bugs
Thanks for replying.

> This is the same issue with what is discussed in [1] and recently
> reported by [2].
>
> [1] https://www.postgresql.org/message-id/E63E5670-6CC3-4B09-9686-A77CF94FE4A8%40amazon.com
>
> [2] https://www.postgresql.org/message-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

By reading the suggested thread, I understood that the cause of this problem is that the standby instance was promoted
withsegment boundary WAL record only partial transferred. 
So the primary server's full WAL storage was not necessarily a factor.

Just the other day, a similar crash occurred in the same customer when an intermediate standby in a cascaded
replicationwas promoted even though the WAL was not full in primary. 
I think it's the same issue because the logged WAL position at the down point was just the segment boundary.


I applied the suggested threads patch [1] and [2] to master (055fee7eb4dcc78e58672aef146334275e1cc40d) and tested it.

However, with applying the patch [1] only, this abort issue was not solved when I performed my reproduce procedure.

On the other hand, the suggested thread [2]'s patch seems to solve this issue.
With applying the patch [2], the cascade standby server's abort no longer occurred.
The logfile also shows that it prevented the abort while same condition.
I attach the logfile of NODE-C. (debug5)

2021-04-02 19:20:27.430 JST [110914] DEBUG:  sending write 0/6000000 flush 0/6000000 apply 0/5FFFFE0 (reply requested)
2021-04-02 19:20:27.430 JST [110914] DEBUG:  sendtime 2021-04-02 19:20:27.430755+09 receipttime 2021-04-02
19:20:27.430776+09replication apply delay 91585 ms transfer latency 1 ms 
2021-04-02 19:20:35.932 JST [110914] LOG:  replication terminated by primary server
2021-04-02 19:20:35.932 JST [110914] DETAIL:  End of WAL reached on timeline 1 at 0/6000000.
2021-04-02 19:20:35.932 JST [110914] LOG:  fetching timeline history file for timeline 2 from primary server
2021-04-02 19:20:35.943 JST [110914] DEBUG:  walreceiver ended streaming and awaits new instructions
2021-04-02 19:20:35.943 JST [110910] LOG:  new target timeline is 2
2021-04-02 19:20:35.943 JST [110910] DEBUG:  switched WAL source from stream to archive after failure
2021-04-02 19:20:35.943 JST [110910] DEBUG:  could not open file "pg_wal/000000020000000000000006": No such file or
directory
2021-04-02 19:20:35.943 JST [110910] DEBUG:  switched WAL source from archive to stream after failure
2021-04-02 19:20:35.943 JST [110914] LOG:  restarted WAL streaming at 0/6000000 on timeline 2
2021-04-02 19:20:35.943 JST [110914] DEBUG:  sendtime 2021-04-02 19:20:35.943432+09 receipttime 2021-04-02
19:20:35.94346+09replication apply delay 100098 ms transfer latency 1 ms 
2021-04-02 19:20:35.943 JST [110914] DEBUG:  creating and filling new WAL file
2021-04-02 19:20:36.032 JST [110914] DEBUG:  done creating and filling new WAL file
2021-04-02 19:20:36.032 JST [110914] DEBUG:  sending write 0/6000038 flush 0/5FFFFE0 apply 0/5FFFFE0
2021-04-02 19:20:36.033 JST [110914] DEBUG:  sending write 0/6000038 flush 0/6000038 apply 0/5FFFFE0

=> After the timeline changing, re-reading of the old segment file has occurred, but it has not cause the abort.

A patch in thread [2] appears to be required to fix this issue. I hope that this patch will be committed.

Regards.
Yusuke, Egashira.




pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16951: pg_restore segfaults on custom format piped from a different version of PG
Next
From: Radoslav Nedyalkov
Date:
Subject: standby fails with out-of-order XID insertion upon start