Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAA4eK1+kLSPrkUGL=hCgyyHNK4oNhnXKvN-dFLjcXMJGEayOUQ@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby RE: Synchronizing slots from primary to standby |
List | pgsql-hackers |
On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown > > slotsync worker and drop slots. There could be other reasons(other than > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code > > there. I thought if the intention is to stop slotsync workers on promotion, > > maybe FinishWalRecovery() is a better place to do it as it's indicating the end > > of recovery and XLogShutdownWalRcv is also called in it. > > I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() > in v35. That looks ok. > > I was thinking what if we just ignore creating such slots (which require init state) in the first place? I think that can be time-consuming in some cases but it will reduce the complexity and we can always improve such cases later if we really encounter them in the real world. I am not very sure that added complexity is worth addressing this particular case, so I would like to know your and others' opinions. More Review for v35-0002* ============================ 1. + ereport(WARNING, + errmsg("skipping slots synchronization as primary_slot_name " + "is not set.")); There is no need to use a full stop at the end for WARNING messages and as previously mentioned, let's not split message lines in such cases. There are other messages in the patch with similar problems, please fix those as well. 2. +slotsync_checks() { ... ... + /* The hot_standby_feedback must be ON for slot-sync to work */ + if (!hot_standby_feedback) + { + ereport(WARNING, + errmsg("skipping slots synchronization as hot_standby_feedback " + "is off.")); This message has the same problem as mentioned in the previous comment. Additionally, I think either atop slotsync_checks or along with GUC check we should write comments as to why we expect these values to be set for slot sync to work. 3. + /* The worker is running already */ + if (SlotSyncWorker &&SlotSyncWorker->hdr.in_use + && SlotSyncWorker->hdr.proc) The spacing for both the &&'s has problems. You need a space after the first && and the second && should be in the prior line. 4. + LauncherRereadConfig(&recheck_slotsync); + } An empty line after LauncherRereadConfig() is not required. 5. +static void +LauncherRereadConfig(bool *ss_recheck) +{ + char *conninfo = pstrdup(PrimaryConnInfo); + char *slotname = pstrdup(PrimarySlotName); + bool syncslot = enable_syncslot; + bool feedback = hot_standby_feedback; Can we change the variable name 'feedback' to 'standbyfeedback' to make it slightly more descriptive? 6. The logic to recheck the slot_sync related parameters in LauncherMain() is not very clear. IIUC, if after reload config any parameter is changed, we just seem to be checking the validity of the changed parameter but not restarting the slot sync worker, is that correct? If so, what if dbname is changed, don't we need to restart the slot-sync worker and re-initialize the connection; similarly slotname change also needs some thoughts. Also, if all the parameters are valid we seem to be re-launching the slot-sync worker without first stopping it which doesn't seem correct, am I missing something in this logic? 7. @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn, errmsg("replication slot \"%s\" was not created in this database", NameStr(slot->data.name)))); + in_recovery = RecoveryInProgress(); + + /* + * Do not allow consumption of a "synchronized" slot until the standby + * gets promoted. Also do not allow consumption of slots with sync_state + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be + * used. + */ + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || + slot->data.sync_state == SYNCSLOT_STATE_INITIATED) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot use replication slot \"%s\" for logical decoding", + NameStr(slot->data.name)), + in_recovery ? + errdetail("This slot is being synced from the primary server.") : + errdetail("This slot was not synced completely from the primary server."), + errhint("Specify another replication slot."))); + If we are planning to drop slots in state SYNCSLOT_STATE_INITIATED at the time of promotion, don't we need to just have an assert or elog(ERROR, .. for non-recovery cases as such cases won't be reachable? If so, I think we can separate out that case here. 8. wait_for_primary_slot_catchup() { ... + /* Check if this standby is promoted while we are waiting */ + if (!RecoveryInProgress()) + { + /* + * The remote slot didn't pass the locally reserved position at + * the time of local promotion, so it's not safe to use. + */ + ereport( + WARNING, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg( + "slot-sync wait for slot %s interrupted by promotion, " + "slot creation aborted", remote_slot->name))); + pfree(cmd.data); + return false; + } ... } Shouldn't this be an Assert because a slot-sync worker shouldn't exist for non-standby servers? 9. wait_for_primary_slot_catchup() { ... + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) + { + ereport(WARNING, + (errmsg("slot \"%s\" disappeared from the primary server," + " slot creation aborted", remote_slot->name))); + pfree(cmd.data); + walrcv_clear_result(res); + return false; If the slot on primary disappears, shouldn't this part of the code somehow ensure to remove the slot on standby as well? If it is taken at some other point in time then at least we should write a comment here to state how it is taken care of. I think this comment also applies to a few other checks following this check. 10. + /* + * It is possible to get null values for lsns and xmin if slot is + * invalidated on the primary server, so handle accordingly. + */ + new_invalidated = DatumGetBool(slot_getattr(slot, 1, &isnull)); We can say LSN and Xmin in the above comment to make it easier to read/understand. 11. /* + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin + * are expected to be valid/non-null, so assert if found null. + */ No need to explicitly say about assert, it is clear from the code. We can slightly change this comment to: "Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin are expected to be valid/non-null." 12. + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || + TransactionIdPrecedes(remote_slot->catalog_xmin, + MyReplicationSlot->data.catalog_xmin)) + { + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) + { + /* + * The remote slot didn't catch up to locally reserved position. + * But still persist it and attempt the wait and sync in next + * sync-cycle. + */ + if (MyReplicationSlot->data.persistency != RS_PERSISTENT) + { + ReplicationSlotPersist(); + *slot_updated = true; + } I think the reason to persist in this case is because next time local restart_lsn can be ahead than the current location and it can take more time to create such a slot. We can probably mention the same in the comments. -- With Regards, Amit Kapila.
pgsql-hackers by date: