RE: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Synchronizing slots from primary to standby |
Date | |
Msg-id | OS0PR01MB5716114A49E6DB9020766D6E94BBA@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
On Saturday, November 18, 2023 6:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to > > > shutdown slotsync worker and drop slots. There could be other > > > reasons(other than > > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to > > > reach the code there. I thought if the intention is to stop slotsync > > > workers on promotion, maybe FinishWalRecovery() is a better place to > > > do it as it's indicating the end of recovery and XLogShutdownWalRcv is also > called in it. > > > > I can see that slotsync_drop_initiated_slots() has been moved in > > FinishWalRecovery() in v35. That looks ok. > > > > > > More Review for v35-0002* Thanks for the comments. > ============================ > 1. > + ereport(WARNING, > + errmsg("skipping slots synchronization as primary_slot_name " > + "is not set.")); > > There is no need to use a full stop at the end for WARNING messages and as > previously mentioned, let's not split message lines in such cases. There are > other messages in the patch with similar problems, please fix those as well. Adjusted. > > 2. > +slotsync_checks() > { > ... > ... > + /* The hot_standby_feedback must be ON for slot-sync to work */ if > + (!hot_standby_feedback) { ereport(WARNING, errmsg("skipping slots > + synchronization as hot_standby_feedback " > + "is off.")); > > This message has the same problem as mentioned in the previous comment. > Additionally, I think either atop slotsync_checks or along with GUC check we > should write comments as to why we expect these values to be set for slot sync > to work. Added comments for these cases. > > 3. > + /* The worker is running already */ > + if (SlotSyncWorker &&SlotSyncWorker->hdr.in_use && > + SlotSyncWorker->hdr.proc) > > The spacing for both the &&'s has problems. You need a space after the first > && and the second && should be in the prior line. Adjusted. > > 4. > + LauncherRereadConfig(&recheck_slotsync); > + > } > > An empty line after LauncherRereadConfig() is not required. > > 5. > +static void > +LauncherRereadConfig(bool *ss_recheck) > +{ > + char *conninfo = pstrdup(PrimaryConnInfo); > + char *slotname = pstrdup(PrimarySlotName); > + bool syncslot = enable_syncslot; > + bool feedback = hot_standby_feedback; > > Can we change the variable name 'feedback' to 'standbyfeedback' to make it > slightly more descriptive? Changed. > > 6. The logic to recheck the slot_sync related parameters in > LauncherMain() is not very clear. IIUC, if after reload config any parameter is > changed, we just seem to be checking the validity of the changed parameter > but not restarting the slot sync worker, is that correct? If so, what if dbname is > changed, don't we need to restart the slot-sync worker and re-initialize the > connection; similarly slotname change also needs some thoughts. Also, if all the > parameters are valid we seem to be re-launching the slot-sync worker without > first stopping it which doesn't seem correct, am I missing something in this > logic? I think the slot sync worker will be stopped in LauncherRereadConfig() if GUC changed and new slot sync worker will be started in next loop in LauncherMain(). > 7. > @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn, > errmsg("replication slot \"%s\" was not created in this database", > NameStr(slot->data.name)))); > > + in_recovery = RecoveryInProgress(); > + > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. Also do not allow consumption of slots with > + sync_state > + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be > + * used. > + */ > + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || > + slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot use replication slot \"%s\" for logical decoding", > + NameStr(slot->data.name)), in_recovery ? > + errdetail("This slot is being synced from the primary server.") : > + errdetail("This slot was not synced completely from the primary > + server."), errhint("Specify another replication slot."))); > + > > If we are planning to drop slots in state SYNCSLOT_STATE_INITIATED at the > time of promotion, don't we need to just have an assert or elog(ERROR, .. for > non-recovery cases as such cases won't be reachable? If so, I think we can > separate out that case here. Adjusted the codes as suggested. > > 8. > wait_for_primary_slot_catchup() > { > ... > + /* Check if this standby is promoted while we are waiting */ if > + (!RecoveryInProgress()) { > + /* > + * The remote slot didn't pass the locally reserved position at > + * the time of local promotion, so it's not safe to use. > + */ > + ereport( > + WARNING, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg( > + "slot-sync wait for slot %s interrupted by promotion, " > + "slot creation aborted", remote_slot->name))); pfree(cmd.data); return > + false; } > ... > } > > Shouldn't this be an Assert because a slot-sync worker shouldn't exist for > non-standby servers? Changed to Assert. > > 9. > wait_for_primary_slot_catchup() > { > ... > + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) { > + ereport(WARNING, (errmsg("slot \"%s\" disappeared from the primary > + server," > + " slot creation aborted", remote_slot->name))); pfree(cmd.data); > + walrcv_clear_result(res); return false; > > If the slot on primary disappears, shouldn't this part of the code somehow > ensure to remove the slot on standby as well? If it is taken at some other point > in time then at least we should write a comment here to state how it is taken > care of. I think this comment also applies to a few other checks following this > check. I adjusted the code here to not persist the slots if the slot disappeared or invalidated on primary, so that the local slot will get dropped when releasing. > > 10. > + /* > + * It is possible to get null values for lsns and xmin if slot is > + * invalidated on the primary server, so handle accordingly. > + */ > + new_invalidated = DatumGetBool(slot_getattr(slot, 1, &isnull)); > > We can say LSN and Xmin in the above comment to make it easier to > read/understand. Changed. > > 11. > /* > + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin > + * are expected to be valid/non-null, so assert if found null. > + */ > > No need to explicitly say about assert, it is clear from the code. We can slightly > change this comment to: "Once we got valid restart_lsn, then confirmed_lsn > and catalog_xmin are expected to be valid/non-null." Changed. > > 12. > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || > + TransactionIdPrecedes(remote_slot->catalog_xmin, > + MyReplicationSlot->data.catalog_xmin)) > + { > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) { > + /* > + * The remote slot didn't catch up to locally reserved position. > + * But still persist it and attempt the wait and sync in next > + * sync-cycle. > + */ > + if (MyReplicationSlot->data.persistency != RS_PERSISTENT) { > + ReplicationSlotPersist(); *slot_updated = true; } > > I think the reason to persist in this case is because next time local restart_lsn can > be ahead than the current location and it can take more time to create such a > slot. We can probably mention the same in the comments. Updated the comments. Here is the V37 patch set which addressed comments above and [1]. [1] https://www.postgresql.org/message-id/CAA4eK1%2BP9R3GO2rwGBg2EOh%3DuYjWUSEOHD8yvs4Je8WYa2RHag%40mail.gmail.com Best Regards, Hou zj
Attachment
pgsql-hackers by date: