Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAD21AoCoX+jhy_i3v+T2s78NG_0HH1oXOUiTOWhDdxVPBtDHKA@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
RE: Synchronizing slots from primary to standby
List pgsql-hackers
On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > +static bool
> > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> > {
> > ...
> > + /* Slot ready for sync, so sync it. */
> > + else
> > + {
> > + /*
> > + * Sanity check: With hot_standby_feedback enabled and
> > + * invalidations handled appropriately as above, this should never
> > + * happen.
> > + */
> > + if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > + elog(ERROR,
> > + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > + " to remote slot's LSN(%X/%X) as synchronization"
> > + " would move it backwards", remote_slot->name,
> > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> > + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> > ...
> > }
> >
> > I was thinking about the above code in the patch and as far as I can
> > think this can only occur if the same name slot is re-created with
> > prior restart_lsn after the existing slot is dropped. Normally, the
> > newly created slot (with the same name) will have higher restart_lsn
> > but one can mimic it by copying some older slot by using
> > pg_copy_logical_replication_slot().
> >
> > I don't think as mentioned in comments even if hot_standby_feedback is
> > temporarily set to off, the above shouldn't happen. It can only lead
> > to invalidated slots on standby.
> >
> > To close the above race, I could think of the following ways:
> > 1. Drop and re-create the slot.
> > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > ahead of local_slot's LSN then we can update it; but as mentioned in
> > your previous comment, we need to update all other fields as well. If
> > we follow this then we probably need to have a check for catalog_xmin
> > as well.
> >
>
> The second point as mentioned is slightly misleading, so let me try to
> rephrase it once again: Emit LOG/WARNING in this case and once
> remote_slot's LSN moves ahead of local_slot's LSN then we can update
> it; additionally, we need to update all other fields like two_phase as
> well. If we follow this then we probably need to have a check for
> catalog_xmin as well along remote_slot's restart_lsn.
>
> > Now, related to this the other case which needs some handling is what
> > if the remote_slot's restart_lsn is greater than local_slot's
> > restart_lsn but it is a re-created slot with the same name. In that
> > case, I think the other properties like 'two_phase', 'plugin' could be
> > different. So, is simply copying those sufficient or do we need to do
> > something else as well?
> >
>
> Bertrand, Dilip, Sawada-San, and others, please share your opinion on
> this problem as I think it is important to handle this race condition.

Is there any good use case of copying a failover slot in the first
place? If it's not a normal use case and we can probably live without
it, why not always disable failover during the copy? FYI we always
disable two_phase on copied slots. It seems to me that copying a
failover slot could lead to problems, as long as we synchronize slots
based on their names. IIUC without the copy, this pass should never
happen.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: doc: add LITERAL tag to RETURNING
Next
From: Alvaro Herrera
Date:
Subject: Re: Compile warnings in dbcommands.c building with meson