Home > mailing lists

Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From	shveta malik
Subject	Re: Synchronizing slots from primary to standby
Date	December 19, 2023 14:47:15
Msg-id	CAJpy0uB9SfgFnOekSnysfJTToyWQMGXcwu5803HFREEW_FDT0A@mail.gmail.com Whole thread Raw
In response to	Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: Synchronizing slots from primary to standby
List	pgsql-hackers

Tree view

On Mon, Dec 18, 2023 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 15, 2023 at 11:03 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > Sorry, I missed attaching the patch. PFA v48.
> >
>
> Few comments on v48_0002
> ========================

Thanks for reviewing. These are addressed in v50. Please find my
comments inline for some of these.

> 1.
> +static void
> +slotsync_reread_config(WalReceiverConn *wrconn)
> {
> ...
> + pfree(old_primary_conninfo);
> + pfree(old_primary_slotname);
> +
> + if (restart)
> + {
> + char    *msg = "slot sync worker will restart because of a parameter change";
> +
> + /*
> + * The exit code 1 will make postmaster restart the slot sync worker.
> + */
> + slotsync_worker_exit(msg, 1 /* proc_exit code */ );
> + }
> ...
>
> I don't see the need to explicitly pfree in case we are already
> exiting the process because anyway the memory will be released. We can
> avoid using the 'restart' variable for this.

I have moved pfree to the end where we do not exit the worker. Removed
restart variable.

>Also, probably, directly
> exiting here makes sense and at another place where this function is
> used. I see that in maybe_reread_subscription(), we exit with a 0 code
> and still apply worker restarts, so why use a different exit code
> here?
>

Logical rep worker is started by logical rep launcher and it has
different logic of restarting it. OTOH, slot-sync worker is started by
the postmaster and the postmaster starts any of its bgworkers only if
the worker had an abnormal exit and restart_time is given during
registration of the worker. Thus we need exit_code here. I have
removed the new function added though.

> 2.
> +static void
> +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby)
> {
> ...
> + remote_in_recovery = DatumGetBool(slot_getattr(tupslot, 1, &isnull));
> + Assert(!isnull);
> +
> + /* No need to check further, return that we are cascading standby */
> + if (remote_in_recovery)
> + {
> + *am_cascading_standby = true;
> + CommitTransactionCommand();
> + return;
> ...
> }
>
> Don't we need to clear the result and tuple in case of early return?

Yes, it was needed. Modified.

>
> 3. It would be a good idea to mention about requirements like a
> physical slot on primary, hot_standby_feedback, etc. in the commit
> message.
>
> 4.
> +static bool
> +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
> +   bool *wait_attempts_exceeded)
> {
> ...
> + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple);
> + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
> + {
> + ereport(WARNING,
> + (errmsg("slot \"%s\" creation aborted", remote_slot->name),
> + errdetail("This slot was not found on the primary server")));
> ...
> + /*
> + * It is possible to get null values for LSN and Xmin if slot is
> + * invalidated on the primary server, so handle accordingly.
> + */
> + new_invalidated = DatumGetBool(slot_getattr(tupslot, 1, &isnull));
> + Assert(!isnull);
> +
> + new_restart_lsn = DatumGetLSN(slot_getattr(tupslot, 2, &isnull));
> + if (new_invalidated || isnull)
> + {
> + ereport(WARNING,
> + (errmsg("slot \"%s\" creation aborted", remote_slot->name),
> + errdetail("This slot was invalidated on the primary server")));
> ...
> }
>
> a. The errdetail message should end with a full stop. Please check all
> other errdetail messages in the patch to follow the same guideline.
> b. I think saying slot creation aborted is not completely correct
> because we would have created the slot especially when it is in 'i'
> state. Can we change it to something like: "aborting initial sync for
> slot \"%s\""?
> c. Also, if the remote_slot is invalidated, ideally, we can even drop
> the local slot but it seems that the patch will drop the same before
> the next-sync cycle with any other slot that needs to be dropped. If
> so, can we add the comment to indicate the same?

I have added comments. Basically, it will be dropped in caller only if
it is 'RS_EPHEMERAL' state else if it is already persisted, then will
be maintained as is but marked as invalidated in caller and its sync
will be skipped next time onwards.

>
> 5.
> +static void
> +local_slot_update(RemoteSlot *remote_slot)
> +{
> + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
> +
> + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn);
> + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn,
> +    remote_slot->catalog_xmin);
> + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn,
> +   remote_slot->restart_lsn);
> +
> + SpinLockAcquire(&MyReplicationSlot->mutex);
> + MyReplicationSlot->data.invalidated = remote_slot->invalidated;
> + SpinLockRelease(&MyReplicationSlot->mutex);
> ...
> ...
>
> If required, the invalidated flag is updated in the caller as well, so
> why do we need to update it here as well?
>

It was needed by the part where the slot is not existing and we need
to create a new slot. I have now moved the invalidation check in
caller; we do not create the slot itself if remote_slot is found as
invalidated one in the beginning. And if it is invalidated in between
of the wait logic, then it will be dropped by ReplicationSlotRelease.

> --
> With Regards,
> Amit Kapila.

pgsql-hackers by date:

From: shveta malik
Date: 19 December 2023, 14:40:57
Subject: Re: Synchronizing slots from primary to standby

From: shveta malik
Date: 19 December 2023, 14:58:07
Subject: Re: Synchronizing slots from primary to standby

Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

Previous

Next