Re: [Bug Fix]standby may crash when switching-over in certain special cases - Mailing list pgsql-hackers

From px shi
Subject Re: [Bug Fix]standby may crash when switching-over in certain special cases
Date
Msg-id CAAccyYLbDnZgaXNeU3dz7nO0qtJRAT9p=ud4UEukpgiEqwT1wA@mail.gmail.com
Whole thread Raw
In response to Re: [Bug Fix]standby may crash when switching-over in certain special cases  (Yugo Nagata <nagata@sraoss.co.jp>)
List pgsql-hackers

Hi, 


I inserted the following code in walsender.c:2509(v15.8) to reproduce the issue.


{

        WalSnd *walsnd = MyWalSnd;


        SpinLockAcquire(&walsnd->mutex);


        if (walsnd->flush % wal_segment_size == 0 && walsnd->sentPtr == walsnd->flush && walsnd->flush > wal_segment_size)

        {

                for (int i = 0; i < max_wal_senders; i++)

                {

                       WalSnd *walsnder = &WalSndCtl->walsnds[i];

                        if (walsnder->pid == walsnd->pid)

                                continue;

                        SpinLockAcquire(&walsnder->mutex);

                        if (walsnder->pid == 0)

                        {

                                SpinLockRelease(&walsnder->mutex);

                                continue;

                        }

                        if (walsnder->flush % wal_segment_size == 0 && walsnder->flush > wal_segment_size)

                                elog(PANIC, "simulate primary crashed, flushedlsn %X/%X", LSN_FORMAT_ARGS(walsnder->flush));

                        SpinLockRelease(&walsnder->mutex);

                }

       }

       SpinLockRelease(&walsnd->mutex);


}


Regards,

Pixian Shi


px shi <spxlyy123@gmail.com> 于2024年10月10日周四 15:01写道:
 Although I've not tested it because I don't have good way to reproduce the problem
 

I use GDB to reproduce the issue by killing the primary node with kill -9 when the standby’s flushedlsn was at the begining of a WAL segment.



However, this causes to request the primary to stream data from the requested location again, even if the standby has already valid data. It could be wasteful, for example, when applying WAL record is delayed due to conflict with recovery or recovery_min_apply_delay.

 I had overlooked this problem, thanks for pointing that out.

I think that comparing walrcv->receiveStart with recptr would be a better approach. Specifically, if walrcv->receiveStart is greater than recptr,  then reset walrcv->flushedUpto. This way, we can prevent fetching WAL data from the primary when the standby already has valid data.A new patch has been submitted.


Regards,

Pixian Shi

 

Yugo NAGATA <nagata@sraoss.co.jp> 于2024年10月9日周三 14:46写道:
On Mon, 30 Sep 2024 15:14:54 +0800
px shi <spxlyy123@gmail.com> wrote:

> Thanks for responding.
>
>
> > It is odd that the standby server crashes when
>
> replication fails because the standby would keep retrying to get the
>
> next record even in such case.
>
>
>  As I mentioned earlier, when replication fails, it retries to establish
> streaming replication. At this point, the value of *walrcv->flushedUpto *is
> not necessarily the data actually flushed to disk. However, the startup
> process mistakenly believes that the latest flushed LSN is
> *walrcv->flushedUpto* and attempts to open the corresponding WAL file,
> which doesn't exist, leading to a file open failure and causing the startup
> process to PANIC.

Although I've not tested it because I don't have good way to reproduce the problem,
the patch seems to fix the issue by rewinding walrcv->flushedUpto to the requested
WAL location always when the streaming is requested to start. However, this causes
to request the primary to stream data from the requested location again, even if
the standby has already valid data. It could be wasteful, for example, when applying WAL
record is delayed due to conflict with recovery or recovery_min_apply_delay.

It might be better if if could notice that there is not requested record in the primary's
timeline before requesting the streaming, but I don't have a good solution for now.

Regards,
Yugo Nagata

> Regards,
> Pixian Shi
>
> Yugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道:
>
> > On Wed, 21 Aug 2024 09:11:03 +0800
> > px shi <spxlyy123@gmail.com> wrote:
> >
> > > Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
> > >
> > > >
> > > >
> > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
> > standbys
> > > > of
> > > > > the primary server respectively, it is not surprising that s2 has
> > > > progressed
> > > > > far than s1 when the primary fails. I believe that this is the case
> > you
> > > > should
> > > > > use pg_rewind. Even if flushedUpto is reset as proposed in your
> > patch,
> > > > s2 might
> > > > > already have applied a WAL record that s1 has not processed yet, and
> > > > there
> > > > > would be no gurantee that subsecuent applys suceed.
> > > >
> > > >
> > >  Thank you for your response. In my scenario, s1 and s2 is the standbys
> > of
> > > the primary server respectively, and s1 a synchronous standby and s2 is
> > an
> > > asynchronous standby. You mentioned that if s2's replay progress is ahead
> > > of s1, pg_rewind should be used. However, what I'm trying to address is
> > an
> > > issue where s2 crashes during replay after s1 has been promoted to
> > primary,
> > > even though s2's progress hasn't surpassed s1.
> >
> > I understood your point. It is odd that the standby server crashes when
> > replication fails because the standby would keep retrying to get the
> > next record even in such case.
> >
> > Regards,
> > Yugo Nagata
> >
> > >
> > > Regards,
> > > Pixian Shi
> >
> >
> > --
> > Yugo NAGATA <nagata@sraoss.co.jp>
> >


--
Yugo NAGATA <nagata@sraoss.co.jp>

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: add support for the old naming libs convention on windows (ssleay32.lib and libeay32.lib)
Next
From: "Jelte Fennema-Nio"
Date:
Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information