Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAHut+PtyoRf3adoLoTrbL6momzkhXAFKz656Vv9YRu4cp=6Yig@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
Here are some review comments for v78-0001 ====== GENERAL 1. Should the "Chapter 30 Logical Replication" at least have another section that mentions the feature of slot synchronization so the information about it is easier to find? It doesn't need to say much -- just give a reference to the other sections where it is explained already. ====== Commit Message 2. A new 'synced' flag is introduced for replication slots, indicating whether the slot has been synchronized from the primary server. On a standby, synced slots cannot be dropped or consumed, and any attempt to perform logical decoding on them will result in an error. ~ It doesn't say *where* is this new 'synced' flag. ~~~ 3. The logical replication slots on the primary can be synchronized to the hot standby by enabling the failover option during slot creation and calling pg_sync_replication_slots() function on the standby. For the synchronization to work, it is mandatory to have a physical replication slot between the primary and the standby, hot_standby_feedback must be enabled on the standby and a valid dbname must be specified in primary_conninfo. ~ 3a. "by enabling the failover option during slot creation" -- Should you elaborate more about that part by mentioning the failover parameter of the create slot API, or the "failover" option of the CREATE SUBSCRIPTION? ~ 3b. I find it easy to read if the GUC parameters are quoted, but YMMV. /hot_standby_feedback/'hot_standby_feedback'/ /primary_conninfo/'primary_conninfo'/ ~~~ 4. If a logical slot on the primary is valid but is invalidated on the standby, then that slot is dropped and can be recreated on the standby in next pg_sync_replication_slots() call provided the slot still exists on the primary server. It is okay to recreate such slots as long as these are not consumable on the standby (which is the case currently). This situation may occur due to the following reasons: - The max_slot_wal_keep_size on the standby is insufficient to retain WAL records from the restart_lsn of the slot. - primary_slot_name is temporarily reset to null and the physical slot is removed. - The primary changes wal_level to a level lower than logical. ~ 4a. /and can be recreated/but will be recreated/ ~ 4b. (As before, I would quote the GUCs for easier readability) /max_slot_wal_keep_size/'max_slot_wal_keep_size'/ /primary_slot_name/'primary_slot_name'/ /'wal_level'/wal_level/ ====== doc/src/sgml/config.sgml 5. + <para> + To synchronize replication slots (see + <xref linkend="logicaldecoding-replication-slots-synchronization"/>), + it is also necessary to specify a valid <literal>dbname</literal> + in the <varname>primary_conninfo</varname> string. This will only be + used for slot synchronization. It is ignored for streaming. </para> Somehow, I thought the below wording is slightly better (and it also matches the linked section title). YMMV. /To synchronize replication slots/For replication slot synchronization/ ====== src/sgml/func.sgml 6. + <row> + <entry id="pg-sync-replication-slots" role="func_table_entry"><para role="func_signature"> + <indexterm> + <primary>pg_sync_replication_slots</primary> Currently, this is in section "9.27.6 Replication Management Functions", but I wondered if it should also have some mention in the "9.27.4. Recovery Control Functions" section. ====== doc/src/sgml/logicaldecoding.sgml 7. + <title>Replication Slot Synchronization</title> + <para> + A logical replication slot on the primary can be synchronized to the hot + standby by enabling the <literal>failover</literal> option during slot + creation and calling <function>pg_sync_replication_slots</function> + on the standby. For the synchronization + to work, it is mandatory to have a physical replication slot between the + primary and the standby, and + <link linkend="guc-hot-standby-feedback"><varname>hot_standby_feedback</varname></link> + must be enabled on the standby. It is also necessary to specify a valid + <literal>dbname</literal> in the + <link linkend="guc-primary-conninfo"><varname>primary_conninfo</varname></link>. + </para> 7a. Should you elaborate more about the "enabling the failover option during slot creation" part by mentioning the failover parameter of the create slot API, or the "failover" option of the CREATE SUBSCRIPTION? ~ 7b. I think it will be better to include a link to the pg_sync_replication_slots function. ~~~ 8. + <para> + To resume logical replication after failover from the synced logical + slots, the subscription's 'conninfo' must be altered to point to the + new primary server. This is done using + <link linkend="sql-altersubscription-params-connection"><command>ALTER SUBSCRIPTION ... CONNECTION</command></link>. + It is recommended that subscriptions are first disabled before promoting + the standby and are enabled back after altering the connection string. + </para> /and are enabled back/and are re-enabled/ ====== src/backend/replication/logical/slotsync.c 9. + * This file contains the code for slot synchronization on a physical standby + * to fetch logical failover slots information from the primary server, create + * the slots on the standby and synchronize them periodically. IIUC there is no "periodically" logic in this patch 0001 anymore because that is now in a later patch, so this part of the comment maybe needs adjustment. ~~~ 10. + * While creating the slot on physical standby, if the local restart_lsn and/or + * local catalog_xmin is ahead of those on the remote then we cannot create the + * local slot in sync with the primary server because that would mean moving + * the local slot backwards and the standby might not have WALs retained for + * old LSN. In this case, the slot will be marked as RS_TEMPORARY. Once the + * primary server catches up, the slot will be marked as RS_PERSISTENT (which + * means sync-ready) and we can perform the sync periodically. 10a. The wording "While creating the slot [...] then we cannot create the local slot" sounds strange. Maybe it can be reworded like SUGGESTION If the physical standby restart_lsn and/or local catalog_xmin is ahead of those on the remote then we cannot create the local standby slot in sync with the primary server because... ~ 10b. /and we can perform the sync periodically./after which we can call pg_sync_replication_slots() periodically to perform syncs./ ~~~ 11. + * The slots that were synchronized will be dropped if they are currently not + * needed to be synchronized. SUGGESTION Any standby synchronized slots will be dropped if they no longer need to be synchronized. See comment atop drop_obsolete_slots() for more details. ~~~ 12. +static bool +local_slot_update(RemoteSlot * remote_slot, Oid remote_dbid) Space after the pointer (*)? ~~~ 13. +/* + * Drop obsolete slots + * + * Drop the slots that no longer need to be synced i.e. these either do not + * exist on the primary or are no longer enabled for failover. + * + * Additionally, it drops slots that are valid on the primary but got + * invalidated on the standby. This situation may occur due to the following + * reasons: + * - The max_slot_wal_keep_size on the standby is insufficient to retain WAL + * records from the restart_lsn of the slot. + * - primary_slot_name is temporarily reset to null and the physical slot is + * removed. + * - The primary changes wal_level to a level lower than logical. + * + * The assumption is that these dropped slots will get recreated in next + * sync-cycle and it is okay to drop and recreate such slots as long as these + * are not consumable on the standby (which is the case currently). + */ 13a. /Additionally, it drops slots/Additionally, drop any slots/ ~ 13b. /max_slot_wal_keep_size/'max_slot_wal_keep_size'/ /primary_slot_name/'primary_slot_name'/ /wal_level/'wal_level'/ ~ 13c. /The assumption is/The assumptions are/ ~~~ 14. +static bool +update_and_persist_slot(RemoteSlot * remote_slot, Oid remote_dbid) Space after the pointer (*)? ~~~ 15. +static bool +synchronize_one_slot(RemoteSlot * remote_slot, Oid remote_dbid) Space after the pointer (*)? ~~~ 16. + if (remote_slot->confirmed_lsn > latestFlushPtr) + { + /* + * Can get here only if GUC 'standby_slot_names' on the primary server + * was not configured correctly. + */ + ereport(ERROR, + errmsg("skipping slot synchronization as the received slot sync" + " LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X", + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + remote_slot->name, + LSN_FORMAT_ARGS(latestFlushPtr))); + + return false; + } Unreachable return false after ERROR? ~~~ 17. +/* + * Using the specified primary server connection, validates primary_slot_name. + */ The comment seems expressed in a backward way. SUGGESTION Validate the 'primary_slot_name' using the specified primary server connection. ~~~ 18. +static void +validate_primary_slot(WalReceiverConn *wrconn, int slot_invalid_elevel) I think here it is the "configuration" that is wrong, not the "slot". So I suggest removing that word slot from the parameter. /slot_invalid_elevel/invalid_elevel/ ~~~ 19. +/* + * Returns true if all necessary GUCs for slot synchronization are set + * appropriately, otherwise returns false. + */ +static bool +validate_slotsync_params(int elevel) 19a. /Returns true/Return true/ /returns false/return false/ ~ 19b. IMO for consistency better to use the same param name as the previous function /elevel/invalid_elevel/ ~~~ 20. +Datum +pg_sync_replication_slots(PG_FUNCTION_ARGS) +{ + WalReceiverConn *wrconn = NULL; + char *err; + StringInfoData app_name; The wrconn assignment at declaration seems unnecessary since it will be immediately overwritten on the first usage. ~~~ 21. + if (cluster_name[0]) + appendStringInfo(&app_name, "%s_%s", cluster_name, "slotsync"); + else + appendStringInfo(&app_name, "%s", "slotsync"); I wondered why this was coded using format string substitutions instead of like below: if (cluster_name[0]) appendStringInfo(&app_name, "%s_slotsync", cluster_name); else appendStringInfoString(&app_name, "slotsync"); OR if (cluster_name[0]) appendStringInfo(&app_name, "%s_", cluster_name); appendStringInfoString(&app_name, "slotsync"); ~~~ 22. + /* + * Establish the connection to the primary server for slots + * synchronization. + */ + wrconn = walrcv_connect(PrimaryConnInfo, false, false, false, + app_name.data, &err); Unnecessarily verbose? SUGGESTION Connect to the primary server. ~~~ 23. + syncing_slots = true; + + PG_TRY(); + { + /* + * Using the specified primary server connection, validates the slot + * in primary_slot_name. + */ + validate_primary_slot(wrconn, ERROR); + + (void) synchronize_slots(wrconn); + } + PG_FINALLY(); + { + syncing_slots = false; + walrcv_disconnect(wrconn); + } + PG_END_TRY(); 23a. IMO the "syncing_slots = true;" can be deferred until immediately before call to synchronize_slots(); ~ 23b. I felt the comment seems backwards, so can be worded as suggested elsewhere in this post. SUGGESTION Validate the 'primary_slot_name' using the specified primary server connection. OTOH, if you can change the function name to validate_primary_slot_name() then no comment is needed because then it becomes self-explanatory. ====== src/backend/replication/slot.c 24. + /* + * Do not allow users to create the slots with failover enabled on the + * standby as we do not support sync to the cascading standby. + * + * Slots with failover enabled can still be created when doing slot + * synchronization, as it needs to maintain this value in sync with the + * remote slots. + */ + if (failover && RecoveryInProgress() && !IsSyncingReplicationSlots()) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a replication slot" + " created on the standby")); I felt it started to become confusing using "synchronization" and "sync" in the same sentence. SUGGESTION However, slots with failover enabled can be created during slot synchronization because we need to retain the same values as the remote slot. ====== .../t/040_standby_failover_slots_sync.pl 25. + +$standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); Since this is where we use the function added by this patch, it deserves to have a comment. SUGGESTION # Synchronize the primary server slots to the standby. ====== src/tools/pgindent/typedefs.list 26. It looks like 'RemoteSlot' should be included in the typedefs.list file. Probably this is the explanation for the space problems I reported earlier in this post. ====== Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: