Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uB9SfgFnOekSnysfJTToyWQMGXcwu5803HFREEW_FDT0A@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
On Mon, Dec 18, 2023 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 15, 2023 at 11:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > Sorry, I missed attaching the patch. PFA v48. > > > > Few comments on v48_0002 > ======================== Thanks for reviewing. These are addressed in v50. Please find my comments inline for some of these. > 1. > +static void > +slotsync_reread_config(WalReceiverConn *wrconn) > { > ... > + pfree(old_primary_conninfo); > + pfree(old_primary_slotname); > + > + if (restart) > + { > + char *msg = "slot sync worker will restart because of a parameter change"; > + > + /* > + * The exit code 1 will make postmaster restart the slot sync worker. > + */ > + slotsync_worker_exit(msg, 1 /* proc_exit code */ ); > + } > ... > > I don't see the need to explicitly pfree in case we are already > exiting the process because anyway the memory will be released. We can > avoid using the 'restart' variable for this. I have moved pfree to the end where we do not exit the worker. Removed restart variable. >Also, probably, directly > exiting here makes sense and at another place where this function is > used. I see that in maybe_reread_subscription(), we exit with a 0 code > and still apply worker restarts, so why use a different exit code > here? > Logical rep worker is started by logical rep launcher and it has different logic of restarting it. OTOH, slot-sync worker is started by the postmaster and the postmaster starts any of its bgworkers only if the worker had an abnormal exit and restart_time is given during registration of the worker. Thus we need exit_code here. I have removed the new function added though. > 2. > +static void > +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > { > ... > + remote_in_recovery = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); > + Assert(!isnull); > + > + /* No need to check further, return that we are cascading standby */ > + if (remote_in_recovery) > + { > + *am_cascading_standby = true; > + CommitTransactionCommand(); > + return; > ... > } > > Don't we need to clear the result and tuple in case of early return? Yes, it was needed. Modified. > > 3. It would be a good idea to mention about requirements like a > physical slot on primary, hot_standby_feedback, etc. in the commit > message. > > 4. > +static bool > +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *wait_attempts_exceeded) > { > ... > + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) > + { > + ereport(WARNING, > + (errmsg("slot \"%s\" creation aborted", remote_slot->name), > + errdetail("This slot was not found on the primary server"))); > ... > + /* > + * It is possible to get null values for LSN and Xmin if slot is > + * invalidated on the primary server, so handle accordingly. > + */ > + new_invalidated = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); > + Assert(!isnull); > + > + new_restart_lsn = DatumGetLSN(slot_getattr(tupslot, 2, &isnull)); > + if (new_invalidated || isnull) > + { > + ereport(WARNING, > + (errmsg("slot \"%s\" creation aborted", remote_slot->name), > + errdetail("This slot was invalidated on the primary server"))); > ... > } > > a. The errdetail message should end with a full stop. Please check all > other errdetail messages in the patch to follow the same guideline. > b. I think saying slot creation aborted is not completely correct > because we would have created the slot especially when it is in 'i' > state. Can we change it to something like: "aborting initial sync for > slot \"%s\""? > c. Also, if the remote_slot is invalidated, ideally, we can even drop > the local slot but it seems that the patch will drop the same before > the next-sync cycle with any other slot that needs to be dropped. If > so, can we add the comment to indicate the same? I have added comments. Basically, it will be dropped in caller only if it is 'RS_EPHEMERAL' state else if it is already persisted, then will be maintained as is but marked as invalidated in caller and its sync will be skipped next time onwards. > > 5. > +static void > +local_slot_update(RemoteSlot *remote_slot) > +{ > + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE); > + > + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); > + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, > + remote_slot->catalog_xmin); > + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, > + remote_slot->restart_lsn); > + > + SpinLockAcquire(&MyReplicationSlot->mutex); > + MyReplicationSlot->data.invalidated = remote_slot->invalidated; > + SpinLockRelease(&MyReplicationSlot->mutex); > ... > ... > > If required, the invalidated flag is updated in the caller as well, so > why do we need to update it here as well? > It was needed by the part where the slot is not existing and we need to create a new slot. I have now moved the invalidation check in caller; we do not create the slot itself if remote_slot is found as invalidated one in the beginning. And if it is invalidated in between of the wait logic, then it will be dropped by ReplicationSlotRelease. > -- > With Regards, > Amit Kapila.
pgsql-hackers by date: