Re: [Bug Fix]standby may crash when switching-over in certain special cases - Mailing list pgsql-hackers

From px shi
Subject Re: [Bug Fix]standby may crash when switching-over in certain special cases
Date
Msg-id CAAccyYKXRVSmfC-YYdPbgsZfPiK_Tk4RLggxWs8UETxfKD7kRA@mail.gmail.com
Whole thread Raw
In response to Re: [Bug Fix]standby may crash when switching-over in certain special cases  (Yugo NAGATA <nagata@sraoss.co.jp>)
Responses Re: [Bug Fix]standby may crash when switching-over in certain special cases
List pgsql-hackers
Thanks for responding.
 
It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.

 As I mentioned earlier, when replication fails, it retries to establish streaming replication. At this point, the value of walrcv->flushedUpto is not necessarily the data actually flushed to disk. However, the startup process mistakenly believes that the latest flushed LSN is walrcv->flushedUpto and attempts to open the corresponding WAL file, which doesn't exist, leading to a file open failure and causing the startup process to PANIC.

Regards,
Pixian Shi

Yugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道:
On Wed, 21 Aug 2024 09:11:03 +0800
px shi <spxlyy123@gmail.com> wrote:

> Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
>
> >
> >
> > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys
> > of
> > > the primary server respectively, it is not surprising that s2 has
> > progressed
> > > far than s1 when the primary fails. I believe that this is the case you
> > should
> > > use pg_rewind. Even if flushedUpto is reset as proposed in your patch,
> > s2 might
> > > already have applied a WAL record that s1 has not processed yet, and
> > there
> > > would be no gurantee that subsecuent applys suceed.
> >
> >
>  Thank you for your response. In my scenario, s1 and s2 is the standbys of
> the primary server respectively, and s1 a synchronous standby and s2 is an
> asynchronous standby. You mentioned that if s2's replay progress is ahead
> of s1, pg_rewind should be used. However, what I'm trying to address is an
> issue where s2 crashes during replay after s1 has been promoted to primary,
> even though s2's progress hasn't surpassed s1.

I understood your point. It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.

Regards,
Yugo Nagata

>
> Regards,
> Pixian Shi


--
Yugo NAGATA <nagata@sraoss.co.jp>

pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Conflict Detection and Resolution
Next
From: Peter Smith
Date:
Subject: Re: Pgoutput not capturing the generated columns