Re: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted - Mailing list pgsql-bugs

From Kyotaro Horiguchi
Subject Re: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted
Date
Msg-id 20210329.113457.933340007488906032.horikyota.ntt@gmail.com
Whole thread Raw
In response to RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted  ("egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com>)
Responses RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted  ("egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com>)
List pgsql-bugs
Hello.

(Mmm. Sorry for annoying, but added some persons in Cc:)

This is the same issue with what is discussed in [1] and recently
reported by [2].

[1] https://www.postgresql.org/message-id/E63E5670-6CC3-4B09-9686-A77CF94FE4A8%40amazon.com

[2] https://www.postgresql.org/message-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com


At Thu, 25 Mar 2021 00:23:52 +0000, "egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com> wrote in 
> > The replication between "NODE-A" and "NODE-B" is synchronous replication,
> > and between "NODE-B" and "NODE-C" is asynchronous.
> > 
> > "NODE-A" <-[synchronous]-> "NODE-B" <-[non-synchronous]-> "NODE-C"
> > 
> > When the primary server "NODE-A" crashed due to full WAL storage and
> > "NODE-B" promoted, the downstream standby server "NODE-C" aborted with
> > following messages.
> > 
> > 2021-03-11 11:26:28.470 JST [85228] LOG:  invalid contrecord length 26 at
> > 0/5FFFFF0
> > 2021-03-11 11:26:28.470 JST [85232] FATAL:  terminating walreceiver process
> > due to administrator command
> > 2021-03-11 11:26:28.470 JST [85228] PANIC:  could not open file
> > "pg_wal/000000020000000000000005": No such file or directory
> > 2021-03-11 11:26:28.492 JST [85260] LOG:  started streaming WAL from primary
> > at 0/5000000 on timeline 2
> > 2021-03-11 11:26:29.260 JST [85227] LOG:  startup process (PID 85228) was
> > terminated by signal 6: Aborted
> 
> I would like to clarify the conditions under which this "abort" occurred to explain to the customer.
> 
> By the result of pg_waldump, I think that the conditions are followings. 
> 
> 1) A partially written (across the following segment files) WAL record is recorded at the end of the WAL segment
file.and
 
> 2) The WAL segment file of 1) is the last WAL segment file that standby server received, and
> 3) The standby server promoted.
> 
> I think that the above conditions will be met only when the primary server crashed due to full WAL storage.
> 
> Is my idea correct?

The diagnosis looks correct to me, but the cause of the crash is
irrelevant.  A disk full just makes the crash hit accurately on the
vital point.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16945: where value in (null) set results inconsistent
Next
From: Pantelis Theodosiou
Date:
Subject: Re: BUG #16945: where value in (null) set results inconsistent