Thread: Re: [Bug Fix]standby may crash when switching-over in certain special cases
On Mon, 19 Aug 2024 16:43:09 +0800 px shi <spxlyy123@gmail.com> wrote: > Hi, hackers, > > I've recently encountered an issue where a standby crashes when > reconnecting to a new primary after a switchover under certain conditions. > Here's a procedure of the crash scenario: > > > 1) We have three instances: one primary and two standbys (s1 and s2, both > using streaming replication). > > > 2) The primary crashed when the standby’s `flushed_lsn` was at the > beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the > following: > > ``` > > FATAL: could not connect to the primary server... > > LOG: waiting for WAL to become available at 0/220000ED > > ``` > > > 3) s1 was promoted to the new primary, s1 logged the following: > > ``` > > LOG: received promote request > > LOG: redo done at 0/21FFFEE8 > > LOG: selected new timeline ID: 2 > > ``` > > > 4) s2's `primary_conninfo` was updated to point to s1, s2 logged the > following: > > ``` > > LOG: received SIGHUP, reloading configuration files > > LOG: parameter "primary_conninfo" changed to ... > > ``` > > > 5) s2 began replication with s1 and attempted to fetch `0/22000000` on > timeline 2, s2 logged the following: > > ``` > > LOG: fetching timeline history file for timeline 2 from primary server > > FATAL: could not start WAL streaming: ERROR: requested starting point > 0/22000000 on timeline 1 is not this server's history > > DETAIL: This server's history forked from timeline 1 at 0/21FFFFE0. > > LOG: started streaming WAL from primary at 0/22000000 on timeline 2 > > ``` > > > 6) WAL record mismatch caused the walreceiver process to terminate, s2 > logged the following: > > ``` > > LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0 > > FATAL: terminating walreceiver process due to administrator command > > ``` > > > 7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the > startup process failed to open the WAL file before it was created, leading > to a crash: > > ``` > > PANIC: could not open file "pg_wal/000000020000000000000021": No such > file or directory > > LOG: startup process was terminated by signal 6: Aborted > > ``` > > > In this scenario, s2 attempts replication twice. First, it starts from > `0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`. > But when a mismatch occurs, the walreceiver process is terminated. On the > second attempt, replication starts from `0/21000000` on timeline 2. The > startup process expects the WAL file to exist because WalRcvFlushRecPtr() > return `0/22000000`, but the file is not found, causing s2's startup > process to crash. > > I think it should check recptr and walrcv->flushedUpto when Request > XLogStreming, so I create a patch for it. Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys of the primary server respectively, it is not surprising that s2 has progressed far than s1 when the primary fails. I believe that this is the case you should use pg_rewind. Even if flushedUpto is reset as proposed in your patch, s2 might already have applied a WAL record that s1 has not processed yet, and there would be no gurantee that subsecuent applys suceed. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
> Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys of
> the primary server respectively, it is not surprising that s2 has progressed
> far than s1 when the primary fails. I believe that this is the case you should
> use pg_rewind. Even if flushedUpto is reset as proposed in your patch, s2 might
> already have applied a WAL record that s1 has not processed yet, and there
> would be no gurantee that subsecuent applys suceed.
On Wed, 21 Aug 2024 09:11:03 +0800 px shi <spxlyy123@gmail.com> wrote: > Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道: > > > > > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys > > of > > > the primary server respectively, it is not surprising that s2 has > > progressed > > > far than s1 when the primary fails. I believe that this is the case you > > should > > > use pg_rewind. Even if flushedUpto is reset as proposed in your patch, > > s2 might > > > already have applied a WAL record that s1 has not processed yet, and > > there > > > would be no gurantee that subsecuent applys suceed. > > > > > Thank you for your response. In my scenario, s1 and s2 is the standbys of > the primary server respectively, and s1 a synchronous standby and s2 is an > asynchronous standby. You mentioned that if s2's replay progress is ahead > of s1, pg_rewind should be used. However, what I'm trying to address is an > issue where s2 crashes during replay after s1 has been promoted to primary, > even though s2's progress hasn't surpassed s1. I understood your point. It is odd that the standby server crashes when replication fails because the standby would keep retrying to get the next record even in such case. Regards, Yugo Nagata > > Regards, > Pixian Shi -- Yugo NAGATA <nagata@sraoss.co.jp>
It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.
On Wed, 21 Aug 2024 09:11:03 +0800
px shi <spxlyy123@gmail.com> wrote:
> Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
>
> >
> >
> > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys
> > of
> > > the primary server respectively, it is not surprising that s2 has
> > progressed
> > > far than s1 when the primary fails. I believe that this is the case you
> > should
> > > use pg_rewind. Even if flushedUpto is reset as proposed in your patch,
> > s2 might
> > > already have applied a WAL record that s1 has not processed yet, and
> > there
> > > would be no gurantee that subsecuent applys suceed.
> >
> >
> Thank you for your response. In my scenario, s1 and s2 is the standbys of
> the primary server respectively, and s1 a synchronous standby and s2 is an
> asynchronous standby. You mentioned that if s2's replay progress is ahead
> of s1, pg_rewind should be used. However, what I'm trying to address is an
> issue where s2 crashes during replay after s1 has been promoted to primary,
> even though s2's progress hasn't surpassed s1.
I understood your point. It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.
Regards,
Yugo Nagata
>
> Regards,
> Pixian Shi
--
Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, 30 Sep 2024 15:14:54 +0800 px shi <spxlyy123@gmail.com> wrote: > Thanks for responding. > > > > It is odd that the standby server crashes when > > replication fails because the standby would keep retrying to get the > > next record even in such case. > > > As I mentioned earlier, when replication fails, it retries to establish > streaming replication. At this point, the value of *walrcv->flushedUpto *is > not necessarily the data actually flushed to disk. However, the startup > process mistakenly believes that the latest flushed LSN is > *walrcv->flushedUpto* and attempts to open the corresponding WAL file, > which doesn't exist, leading to a file open failure and causing the startup > process to PANIC. Although I've not tested it because I don't have good way to reproduce the problem, the patch seems to fix the issue by rewinding walrcv->flushedUpto to the requested WAL location always when the streaming is requested to start. However, this causes to request the primary to stream data from the requested location again, even if the standby has already valid data. It could be wasteful, for example, when applying WAL record is delayed due to conflict with recovery or recovery_min_apply_delay. It might be better if if could notice that there is not requested record in the primary's timeline before requesting the streaming, but I don't have a good solution for now. Regards, Yugo Nagata > Regards, > Pixian Shi > > Yugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道: > > > On Wed, 21 Aug 2024 09:11:03 +0800 > > px shi <spxlyy123@gmail.com> wrote: > > > > > Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道: > > > > > > > > > > > > > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the > > standbys > > > > of > > > > > the primary server respectively, it is not surprising that s2 has > > > > progressed > > > > > far than s1 when the primary fails. I believe that this is the case > > you > > > > should > > > > > use pg_rewind. Even if flushedUpto is reset as proposed in your > > patch, > > > > s2 might > > > > > already have applied a WAL record that s1 has not processed yet, and > > > > there > > > > > would be no gurantee that subsecuent applys suceed. > > > > > > > > > > > Thank you for your response. In my scenario, s1 and s2 is the standbys > > of > > > the primary server respectively, and s1 a synchronous standby and s2 is > > an > > > asynchronous standby. You mentioned that if s2's replay progress is ahead > > > of s1, pg_rewind should be used. However, what I'm trying to address is > > an > > > issue where s2 crashes during replay after s1 has been promoted to > > primary, > > > even though s2's progress hasn't surpassed s1. > > > > I understood your point. It is odd that the standby server crashes when > > replication fails because the standby would keep retrying to get the > > next record even in such case. > > > > Regards, > > Yugo Nagata > > > > > > > > Regards, > > > Pixian Shi > > > > > > -- > > Yugo NAGATA <nagata@sraoss.co.jp> > > -- Yugo NAGATA <nagata@sraoss.co.jp>
Hi,
I inserted the following code in walsender.c:2509(v15.8) to reproduce the issue.
{
WalSnd *walsnd = MyWalSnd;
SpinLockAcquire(&walsnd->mutex);
if (walsnd->flush % wal_segment_size == 0 && walsnd->sentPtr == walsnd->flush && walsnd->flush > wal_segment_size)
{
for (int i = 0; i < max_wal_senders; i++)
{
WalSnd *walsnder = &WalSndCtl->walsnds[i];
if (walsnder->pid == walsnd->pid)
continue;
SpinLockAcquire(&walsnder->mutex);
if (walsnder->pid == 0)
{
SpinLockRelease(&walsnder->mutex);
continue;
}
if (walsnder->flush % wal_segment_size == 0 && walsnder->flush > wal_segment_size)
elog(PANIC, "simulate primary crashed, flushedlsn %X/%X", LSN_FORMAT_ARGS(walsnder->flush));
SpinLockRelease(&walsnder->mutex);
}
}
SpinLockRelease(&walsnd->mutex);
}
Regards,
Pixian Shi
Although I've not tested it because I don't have good way to reproduce the problemI use GDB to reproduce the issue by killing the primary node with kill -9 when the standby’s flushedlsn was at the begining of a WAL segment.
However, this causes to request the primary to stream data from the requested location again, even if the standby has already valid data. It could be wasteful, for example, when applying WAL record is delayed due to conflict with recovery or recovery_min_apply_delay.I had overlooked this problem, thanks for pointing that out.
I think that comparing walrcv->receiveStart with recptr would be a better approach. Specifically, if walrcv->receiveStart is greater than recptr, then reset walrcv->flushedUpto. This way, we can prevent fetching WAL data from the primary when the standby already has valid data.A new patch has been submitted.
Regards,
Pixian Shi
Yugo NAGATA <nagata@sraoss.co.jp> 于2024年10月9日周三 14:46写道:On Mon, 30 Sep 2024 15:14:54 +0800
px shi <spxlyy123@gmail.com> wrote:
> Thanks for responding.
>
>
> > It is odd that the standby server crashes when
>
> replication fails because the standby would keep retrying to get the
>
> next record even in such case.
>
>
> As I mentioned earlier, when replication fails, it retries to establish
> streaming replication. At this point, the value of *walrcv->flushedUpto *is
> not necessarily the data actually flushed to disk. However, the startup
> process mistakenly believes that the latest flushed LSN is
> *walrcv->flushedUpto* and attempts to open the corresponding WAL file,
> which doesn't exist, leading to a file open failure and causing the startup
> process to PANIC.
Although I've not tested it because I don't have good way to reproduce the problem,
the patch seems to fix the issue by rewinding walrcv->flushedUpto to the requested
WAL location always when the streaming is requested to start. However, this causes
to request the primary to stream data from the requested location again, even if
the standby has already valid data. It could be wasteful, for example, when applying WAL
record is delayed due to conflict with recovery or recovery_min_apply_delay.
It might be better if if could notice that there is not requested record in the primary's
timeline before requesting the streaming, but I don't have a good solution for now.
Regards,
Yugo Nagata
> Regards,
> Pixian Shi
>
> Yugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道:
>
> > On Wed, 21 Aug 2024 09:11:03 +0800
> > px shi <spxlyy123@gmail.com> wrote:
> >
> > > Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
> > >
> > > >
> > > >
> > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
> > standbys
> > > > of
> > > > > the primary server respectively, it is not surprising that s2 has
> > > > progressed
> > > > > far than s1 when the primary fails. I believe that this is the case
> > you
> > > > should
> > > > > use pg_rewind. Even if flushedUpto is reset as proposed in your
> > patch,
> > > > s2 might
> > > > > already have applied a WAL record that s1 has not processed yet, and
> > > > there
> > > > > would be no gurantee that subsecuent applys suceed.
> > > >
> > > >
> > > Thank you for your response. In my scenario, s1 and s2 is the standbys
> > of
> > > the primary server respectively, and s1 a synchronous standby and s2 is
> > an
> > > asynchronous standby. You mentioned that if s2's replay progress is ahead
> > > of s1, pg_rewind should be used. However, what I'm trying to address is
> > an
> > > issue where s2 crashes during replay after s1 has been promoted to
> > primary,
> > > even though s2's progress hasn't surpassed s1.
> >
> > I understood your point. It is odd that the standby server crashes when
> > replication fails because the standby would keep retrying to get the
> > next record even in such case.
> >
> > Regards,
> > Yugo Nagata
> >
> > >
> > > Regards,
> > > Pixian Shi
> >
> >
> > --
> > Yugo NAGATA <nagata@sraoss.co.jp>
> >
--
Yugo NAGATA <nagata@sraoss.co.jp>