RE: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Synchronizing slots from primary to standby
Date
Msg-id OS0PR01MB571667D165D51E2805FFD85094052@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
List pgsql-hackers
On Thursday, April 11, 2024 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Apr 10, 2024 at 5:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Thursday, April 4, 2024 5:37 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > BTW, while thinking on this one, I
> > > noticed that in the function LogicalConfirmReceivedLocation(), we
> > > first update the disk copy, see comment [1] and then in-memory
> > > whereas the same is not true in
> > > update_local_synced_slot() for the case when snapshot exists. Now,
> > > do we have the same risk here in case of standby? Because I think we
> > > will use these xmins while sending the feedback message (in
> XLogWalRcvSendHSFeedback()).
> > >
> > > * We have to write the changed xmin to disk *before* we change
> > > * the in-memory value, otherwise after a crash we wouldn't know
> > > * that some catalog tuples might have been removed already.
> >
> > Yes, I think we have the risk on the standby, I can reproduce the case
> > that if the server crashes after updating the in-memory value and
> > before saving them to disk, the synced slot could be invalidated after
> > restarting from crash, because the necessary rows have been removed on
> > the primary. The steps can be found in [1].
> >
> > I think we'd better fix the order in update_local_synced_slot() as
> > well. I tried to make the fix in 0002, 0001 is Shveta's patch to fix
> > another issue in this thread. Since they are touching the same function, so
> attach them together for review.
> >
> 
> Few comments:
> ===============
> 1.
> +
> + /* Sanity check */
> + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn)
> + ereport(LOG, errmsg("synchronized confirmed_flush for slot \"%s\"
> + differs from
> remote slot",
> +    remote_slot->name),
> 
> Is there a reason to use elevel as LOG instead of ERROR? I think it should be
> elog(ERROR, .. as this is an unexpected case.

Agreed.

> 
> 2.
> - if (remote_slot->restart_lsn < slot->data.restart_lsn)
> + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush)
>   elog(ERROR,
>   "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> 
> Can we be more specific in this message? How about splitting it into
> error_message as "cannot synchronize local slot \"%s\"" and then errdetail as
> "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's
> LSN(%X/%X)"?

Your version looks better. Since the above two messages all have errdetail, I
used the style of ereport(ERROR, errmsg_internal(), errdetail_internal()... in
the patch which is equal to the elog(ERROR but has an additional detail message.

Here is V5 patch set.

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: type in basebackup_incremental.c ?
Next
From: Kirill Reshke
Date:
Subject: Re: Allow non-superuser to cancel superuser tasks.