Re: [HACKERS] make async slave to wait for lsn to be replayed - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] make async slave to wait for lsn to be replayed
Date
Msg-id CAPpHfdvGRssjqwX1+idm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] make async slave to wait for lsn to be replayed  (Kartyshov Ivan <i.kartyshov@postgrespro.ru>)
Responses Re: [HACKERS] make async slave to wait for lsn to be replayed
List pgsql-hackers
Hi, Ivan!

On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:
>
> Hi, Alexander, Here, I made some improvements according to your
> discussion with Heikki.
>
> On 2024-04-11 18:09, Alexander Korotkov wrote:
> > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote:
> >> In a nutshell, it's possible for the loop in WaitForLSN to exit
> >> without
> >> cleaning up the process from the heap. I was able to hit that by
> >> adding
> >> a delay after the addLSNWaiter() call:
> >>
> >> > TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> >> > postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> >> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> >> > postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> >> > postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> >> > postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
> >>
> >> I think there's a similar race condition if the timeout is reached at
> >> the same time that the startup process wakes up the process.
> >
> > Thank you for catching this.  I think WaitForLSN() just needs to call
> > deleteLSNWaiter() unconditionally after exit from the loop.
>
> Fix and add injection point test on this race condition.
>
> > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote:
> >> The docs could use some-copy-editing, but just to point out one issue:
> >>
> >> > There are also procedures to control the progress of recovery.
> >>
> >> That's copy-pasted from an earlier sentence at the table that lists
> >> functions like pg_promote(), pg_wal_replay_pause(), and
> >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
> >> the
> >> progress of recovery like those functions do, it only causes the
> >> calling
> >> backend to wait.
>
> Fix documentation and add extra tests on multi-standby replication
> and cascade replication.

Thank you for the revised patch.

I see couple of items which are not addressed in this revision.
 * As Heikki pointed, that it's currently not possible in one round
trip to call call pg_wal_replay_wait() and do other job.  The attached
patch addresses this.  It milds the requirement.  Now, it's not
necessary to be in atomic context.  It's only necessary to have no
active snapshot.  This new requirement works for me so far.  I
appreciate a feedback on this.
 * As Alvaro pointed, multiple waiters case isn't covered by the test
suite.  That leads to no coverage of some code paths.  The attached
patch addresses this by adding a test case with multiple waiters.

The rest looks good to me.

Links.
1. https://www.postgresql.org/message-id/d1303584-b763-446c-9409-f4516118219f%40iki.fi
2.https://www.postgresql.org/message-id/202404051815.eri4u5q6oj26%40alvherre.pgsql

------
Regards,
Alexander Korotkov
Supabase

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: 回复: An implementation of multi-key sort
Next
From: Kohei KaiGai
Date:
Subject: assertion failure at cost_memoize_rescan()