Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uDnufNGVxZ1ZGkZJj0GE-WVYjEyiwAteFNKp_xRxv3M+Q@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (Peter Smith <smithpb2250@gmail.com>) |
List | pgsql-hackers |
On Tue, Dec 19, 2023 at 4:51 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v48-0002 > Thanks for reviewing. Most of these are addressed in v50. Please find my comments for the rest. > ====== > doc/src/sgml/config.sgml > > 1. > + If slot synchronization is enabled then it is also necessary to > + specify <literal>dbname</literal> in the > + <varname>primary_conninfo</varname> string. This will only > be used for > + slot synchronization. It is ignored for streaming. > > I felt the "If slot synchronization is enabled" part should also > include an xref to the enable_slotsync GUC, otherwise there is no > information here about how to enable it. > > SUGGESTION > If slot synchronization is enabled (see XXX) .... > > ====== > doc/src/sgml/logicaldecoding.sgml > > 2. > + <para> > + The ability to resume logical replication after failover depends upon the > + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>sync_state</structfield> > + value for the synchronized slots on the standby at the time of failover. > + Only slots that have attained "ready" sync_state ('r') on the standby > + before failover can be used for logical replication after failover. Slots > + that have not yet reached 'r' state (they are still 'i') will be dropped, > + therefore logical replication for those slots cannot be resumed. For > + example, if the synchronized slot could not become sync-ready on standby > + due to a disabled subscription, then the subscription cannot be resumed > + after failover even when it is enabled. > + If the primary is idle, then the synchronized slots on the standby may > + take a noticeable time to reach the ready ('r') sync_state. This can > + be sped up by calling the > + <function>pg_log_standby_snapshot</function> function on the primary. > + </para> > > 2a. > /sync-ready on standby/sync-ready on the standby/ > > ~ > > 2b. > Should "If the primary is idle" be in a new paragraph? > > ====== > doc/src/sgml/system-views.sgml > > 3. > + <para> > + The hot standby can have any of these sync_state values for the slots but > + on a hot standby, the slots with state 'r' and 'i' can neither be used > + for logical decoding nor dropped by the user. > + The sync_state has no meaning on the primary server; the primary > + sync_state value is default 'n' for all slots but may (if leftover > + from a promoted standby) also be 'r'. > + </para></entry> > > I still feel we are exposing too much useless information about the > primary server values. > > Isn't it sufficient to just say "The sync_state values have no meaning > on a primary server.", and not bother to mention what those > meaningless values might be -- e.g. if they are meaningless then who > cares what they are or how they got there? > I am retaining the original info till we find a better place for it as suggested by Amit in [1] > ====== > src/backend/replication/logical/slotsync.c > > 4. synchronize_one_slot > > + /* Slot ready for sync, so sync it. */ > + if (sync_state == SYNCSLOT_STATE_READY) > + { > + /* > + * Sanity check: With hot_standby_feedback enabled and > + * invalidations handled appropriately as above, this should never > + * happen. > + */ > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > + elog(ERROR, > + "not synchronizing local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization " > + " would move it backwards", remote_slot->name, > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > + > + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush || > + remote_slot->restart_lsn != slot->data.restart_lsn || > + remote_slot->catalog_xmin != slot->data.catalog_xmin) > + { > + /* Update LSN of slot to remote slot's current position */ > + local_slot_update(remote_slot); > + ReplicationSlotSave(); > + slot_updated = true; > + } > + } > + /* Slot not ready yet, let's attempt to make it sync-ready now. */ > + else if (sync_state == SYNCSLOT_STATE_INITIATED) > + { > + /* > + * Wait for the primary server to catch-up. Refer to the comment > + * atop the file for details on this wait. > + */ > + if (remote_slot->restart_lsn < slot->data.restart_lsn || > + TransactionIdPrecedes(remote_slot->catalog_xmin, > + slot->data.catalog_xmin)) > + { > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, NULL)) > + { > + ReplicationSlotRelease(); > + return false; > + } > + } > + > + /* > + * Wait for primary is over, update the lsns and mark the slot as > + * READY for further syncs. > + */ > + local_slot_update(remote_slot); > + SpinLockAcquire(&slot->mutex); > + slot->data.sync_state = SYNCSLOT_STATE_READY; > + SpinLockRelease(&slot->mutex); > + > + /* Save the changes */ > + ReplicationSlotMarkDirty(); > + ReplicationSlotSave(); > + slot_updated = true; > + > + ereport(LOG, > + errmsg("newly locally created slot \"%s\" is sync-ready now", > + remote_slot->name)); > + } > > 4a. > It would be more natural in the code if you do the > SYNCSLOT_STATE_INITIATED logic before the SYNCSLOT_STATE_READY because > that is the order those states come in. > > ~ > > 4b. > I'm not sure if it is worth it, but I was thinking that some duplicate > code can be avoided by doing if/if instead of if/else > > if (sync_state == SYNCSLOT_STATE_INITIATED) > { > .. > } > if (sync_state == SYNCSLOT_STATE_READY) > { > } > > By arranging it this way maybe the SYNCSLOT_STATE_INITIATED code block > doesn't need to do anything except update the sync_state = > SYNCSLOT_STATE_READY; Then it can just fall through to the > SYNCSLOT_STATE_READY logic to do all the > local_slot_update(remote_slot); etc in just one place. > We want to mark the slot as sync-ready once initial sync is over (i.e. confirmed_lsn != NULL). But if we try to optimize as above, we will end up marking it as sync-read before initial-sync itself in local_slot_update() which does not sound like a good idea. > ~~~ > > 5. check_primary_info > > + * Checks the primary server info. > + * > + * Using the specified primary server connection, check whether we > are cascading > + * standby. It also validates primary_slot_name for non-cascading-standbys. > + */ > +static void > +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > > 5a. > /we are cascading/we are a cascading/ > > 5b. > /non-cascading-standbys./non-cascading standbys./ > > ~~~ > > 6. > + CommitTransactionCommand(); > + > + *am_cascading_standby = false; > > Maybe it's simpler just to set this default false up-front, replacing > the current assert. > > BEFORE: > + Assert(am_cascading_standby != NULL); > > AFTER: > *am_cascading_standby = false; /* maybe overwrite later */ > Sure, moved default false up-front. But do we need to replace assert? I think assert is needed to make sure we are not accessing null-pointer later. > ~~~ > > 7. > +/* > + * Exit the slot sync worker with given exit-code. > + */ > +static void > +slotsync_worker_exit(const char *msg, int code) > +{ > + ereport(LOG, errmsg("%s", msg)); > + proc_exit(code); > +} > > This could be written differently (don't pass the exit code, instead > pass a bool) like: > > static void > slotsync_worker_exit(const char *msg, bool restart_worker) > > By doing it this way, you can keep the special exit code values (0,1) > within this function where you can comment all about them instead of > having scattered comments about exit codes in the callers. > > SUGGESTION > ereport(LOG, errmsg("%s", msg)); > /* <some big comment here about how the code causes the worker to > restart or not> */ > proc_exit(restart_worker ? 1 : 0); > I have removed slotsync_worker_exit() function as suggested by Amit in [1]. Thus few of the suggestions (7,8,10) are no longer valid relevant. > ~~~ > > 8. slotsync_reread_config > > + if (restart) > + { > + char *msg = "slot sync worker will restart because of a parameter change"; > + > + /* > + * The exit code 1 will make postmaster restart the slot sync worker. > + */ > + slotsync_worker_exit(msg, 1 /* proc_exit code */ ); > + } > > Shouldn't that message be written as _(), so that it will get translated? > > SUGGESTION > slotsync_worker_exit(_("slot sync worker will restart because of a > parameter change"), true /* restart worker */ ); > > ~~~ > > 9. ProcessSlotSyncInterrupts > > + CHECK_FOR_INTERRUPTS(); > + > + if (ShutdownRequestPending) > + { > + char *msg = "replication slot sync worker is shutting down on > receiving SIGINT"; > + > + walrcv_disconnect(wrconn); > + > + /* > + * The exit code 0 means slot sync worker will not be restarted by > + * postmaster. > + */ > + slotsync_worker_exit(msg, 0 /* proc_exit code */ ); > + } > > Shouldn't that message be written as _(), so that it will be translated? > > SUGGESTION > slotsync_worker_exit(_("replication slot sync worker is shutting down > on receiving SIGINT"), false /* don't restart worker */ ); > > ~~~ > > 10. > +/* > + * Cleanup function for logical replication launcher. > + * > + * Called on logical replication launcher exit. > + */ > +static void > +slotsync_worker_onexit(int code, Datum arg) > +{ > + SpinLockAcquire(&SlotSyncWorker->mutex); > + SlotSyncWorker->pid = InvalidPid; > + SpinLockRelease(&SlotSyncWorker->mutex); > +} > > IMO it would make sense for this function to be defined adjacent to > the slotsync_worker_exit() function. > > ~~~ > > 11. ReplSlotSyncWorkerMain > > + /* > + * Using the specified primary server connection, check whether we are > + * cascading standby and validates primary_slot_name for > + * non-cascading-standbys. > + */ > + check_primary_info(wrconn, &am_cascading_standby); > ... > + /* Recheck if it is still a cascading standby */ > + if (am_cascading_standby) > + check_primary_info(wrconn, &am_cascading_standby); > > Those 2 above calls could be combined if you want. By defaulting the > am_cascading_standby = true when declared, then you could put this > code at the top of the loop instead of having the same code in 2 > places: > > + if (am_cascading_standby) > + check_primary_info(wrconn, &am_cascading_standby); I am not very sure about this change. Yes, as you stated logic-wise it could be combined. But the current flow looks more neat while reading the code. Initializing 'am_cascading_standby' as TRUE could be slightly confusing for the reader. > ====== > src/include/commands/subscriptioncmds.h > > 12. > #include "parser/parse_node.h" > +#include "replication/walreceiver.h" > > There is #include, but no other code change. Is this needed? > > ====== > src/include/replication/slot.h > > 13. > + /* > + * Synchronization state for a logical slot. > + * > + * The standby can have any value among the possible values of 'i','r' and > + * 'n'. For primary, the default is 'n' for all slots but may also be 'r' > + * if leftover from a promoted standby. > + */ > + char sync_state; > + > > All that is OK now, but I keep circling back to my original thought > that since this state has no meaning for the primary server then > > a) why do we even care what potential values it might have there, and > b) isn't it better to call this field 'standby_sync_state' to > emphasize it only has meaning for the standby? > > e.g. > SUGGESTION > /* > * Synchronization state for a logical slot. > * > * The standby can have any value among the possible values of 'i','r' and > * 'n'. For the primary, this field value has no meaning. > */ > char standby_sync_state; > 'sync_state' still looks a better choice to me (discussed with others too offline). If we get more objections to this name, I can consider changing this. [1]: https://www.postgresql.org/message-id/CAA4eK1Kh2cj5vjknAxibpp8Dn%2BjjVwT%2BF7oMPT1P861s_ZrDXQ%40mail.gmail.com thanks Shveta
pgsql-hackers by date: