Re: [HACKERS] make async slave to wait for lsn to be replayed - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] make async slave to wait for lsn to be replayed
Date
Msg-id CAPpHfdvG04w43XbHihkasBKv4X40itcSu_O6M0wWbckC6jCLaA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] make async slave to wait for lsn to be replayed  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
On Sat, Aug 10, 2024 at 6:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:
> > > The 0001 patch is intended to improve this situation.  Actually, it's
> > > not right to just put RecoveryInProgress() after
> > > GetXLogReplayRecPtr(), because more wal could be replayed between
> > > these calls.  Instead we need to recheck GetXLogReplayRecPtr() after
> > > getting negative result of RecoveryInProgress() because WAL replay
> > > position couldn't get updated after.
> > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
> > > 0003 patch comprises tests for pg_wal_replay_wait() errors.
> >
> > Before adding more tests, could it be possible to stabilize what's in
> > the tree?  drongo has reported one failure with the recovery test
> > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54
>
> I'm currently running a 043_wal_replay_wait test in a loop of drongo.
> No failures during more than 10 hours.  As I pointed in [1] it seems
> that test stuck somewhere on launching BackgroundPsql.  Given that
> drongo have some strange failures from time to time (for instance [2]
> or [3]), I doubt there is something specifically wrong in
> 043_wal_replay_wait test that caused the subject failure.

With help of Andrew Dunstan, I've run 043_wal_replay_wait.pl in a loop
for two days, then the whole test suite also for two days.  Haven't
seen any failures.  I don't see the point to run more experiments,
because Andrew needs to bring drongo back online as a buildfarm
member.  It might happen that something exceptional happened on drongo
(like inability to launch a new process or something).  For now, I
think the reasonable strategy would be to wait and see if something
similar will repeat on buildfarm.

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: Paul Jungwirth
Date:
Subject: Re: format_datum debugging function
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: Returning from a rule with extended query protocol