Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From shveta malik
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAJpy0uCRykgTBwu9tTTdxxcM-r9FAerAO4yJgWZc7MFb-06gSw@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
On Tue, Oct 10, 2023 at 12:52 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Oct 9, 2023 at 9:34 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Wed, Oct 4, 2023 at 8:53 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > Here are some review comments for v20-0002.
> > >
> >
> > Thanks Peter for the feedback. Comments from 31 till end are addressed
> > in v22. First 30 comments will be addressed in the next version.
> >
>
> Thanks for addressing my previous comments.
>
> I checked those and went through other changes in v22-0002 to give a
> few more review comments below.
>

Thank You for your feedback. I have addressed these in v23.

> I understand there are some design changes coming soon regarding the
> use of GUCs so maybe a few of these comments will become redundant.
>
> ======
> doc/src/sgml/config.sgml
>
> 1.
>            A password needs to be provided too, if the sender demands password
>            authentication.  It can be provided in the
>            <varname>primary_conninfo</varname> string, or in a separate
> -          <filename>~/.pgpass</filename> file on the standby server (use
> -          <literal>replication</literal> as the database name).
> -          Do not specify a database name in the
> -          <varname>primary_conninfo</varname> string.
> +          <filename>~/.pgpass</filename> file on the standby server.
> +         </para>
> +         <para>
> +          Specify a database name in <varname>primary_conninfo</varname> string
> +          to allow synchronization of slots from the primary to standby. This
> +          dbname will only be used for slots synchronization purpose and will
> +          be irrelevant for streaming.
>           </para>
>
> 1a.
> "Specify a database name in...". Shouldn't that say "Specify dbname in..."?
>
> ~
>
> 1b.
> BEFORE
> This dbname will only be used for slots synchronization purpose and
> will be irrelevant for streaming.
>
> SUGGESTION
> This will only be used for slot synchronization. It is ignored for streaming.
>
> ======
> doc/src/sgml/system-views.sgml
>
> 2. pg_replication_slots
>
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>synced_slot</structfield> <type>bool</type>
> +      </para>
> +      <para>
> +       True if this logical slot is created on physical standby as part of
> +       slot-synchronization from primary server. Always false for
> physical slots.
> +      </para></entry>
> +     </row>
>
> /on physical standby/on the physical standby/
>
> /from primary server/from the primary server/
>
> ======
> src/backend/replication/logical/launcher.c
>
> 3. LaunchSlotSyncWorkers
>
> + /*
> + * If we failed to launch this slotsync worker, return and try
> + * launching rest of the workers in next sync cycle. But change
> + * launcher's wait time to minimum of wal_retrieve_retry_interval and
> + * default wait time to try next sync-cycle sooner.
> + */
>
> 3a.
> Use consistent terms -- choose "sync cycle" or "sync-cycle"
>
> ~
>
> 3b.
> Is it correct to just say "rest of the workers"; won't it also try to
> relaunch this same failed worker again?
>
> ~~~
>
> 4. LauncherMain
>
> + /*
> + * Stop the slot-sync workers if any of the related GUCs changed.
> + * These will be relaunched using the new values during next
> + * sync-cycle. Also revalidate the new configurations and
> + * reconnect.
> + */
> + if (SlotSyncConfigsChanged())
> + {
> + slotsync_workers_stop();
> +
> + if (wrconn)
> + walrcv_disconnect(wrconn);
> +
> + if (RecoveryInProgress())
> + wrconn = slotsync_remote_connect();
> + }
>
> Was it overkill to disconnect/reconnect every time any of those GUCs
> changed? Or is it enough to do that only if the
> PrimaryConnInfoPreReload was changed?
>
> ======
> src/backend/replication/logical/logical.c
>
> 5. CreateDecodingContext
>
> + /*
> + * Do not allow consumption of a "synchronized" slot until the standby
> + * gets promoted.
> + */
> + if (RecoveryInProgress() && slot->data.synced)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot use replication slot \"%s\" for logical decoding",
> + NameStr(slot->data.name)),
> + errdetail("This slot is being synced from primary."),
> + errhint("Specify another replication slot.")));
> +
>
> /from primary/from the primary/
>
> ======
> src/backend/replication/logical/slotsync.c
>
> 6. use_slot_in_query
>
> + /*
> + * Return TRUE if either slot is not yet created on standby or if it
> + * belongs to one of the dbs passed in dbids.
> + */
> + if (!slot_found || relevant_db)
> + return true;
> +
> + return false;
>
> Same as single line:
>
> return (!slot_found || relevant_db);
>
> ~~~
>
> 7. synchronize_one_slot
>
> + /*
> + * If the local restart_lsn and/or local catalog_xmin is ahead of
> + * those on the remote then we cannot create the local slot in sync
> + * with primary because that would mean moving local slot backwards
> + * and we might not have WALs retained for old lsns. In this case we
> + * will wait for primary's restart_lsn and catalog_xmin to catch up
> + * with the local one before attempting the sync.
> + */
>
> /moving local slot/moving the local slot/
>
> /with primary/with the primary/
>
> /wait for primary's/wait for the primary's/
>
> ~~~
>
> 8. ProcessSlotSyncInterrupts
>
> + if (ConfigReloadPending)
> + {
> + ConfigReloadPending = false;
> +
> + /* Save the PrimaryConnInfo before reloading */
> + *conninfo_prev = pstrdup(PrimaryConnInfo);
>
> If the configuration keeps changing then there might be a slow leak
> here because I didn't notice anywhere where this strdup'ed string is
> getting freed. Is that something worth worrying about?
>
> ======
> src/backend/replication/slot.c
>
> 9. ReplicationSlotDrop
>
> + /*
> + * Do not allow users to drop the slots which are currently being synced
> + * from the primary to standby.
> + */
> + if (user_cmd && RecoveryInProgress() && MyReplicationSlot->data.synced)
> + {
> + ReplicationSlotRelease();
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot drop replication slot"),
> + errdetail("This slot is being synced from primary.")));
> + }
> +
>
> 9a.
> /to standby/to the standby/
>
> ~
>
> 9b.
> Shouldn't the errmsg name the slot? Otherwise, the message might not
> be so useful.
>
> ~
>
> 9c.
> /synced from primary/synced from the primary/
>
> ======
> src/backend/replication/walsender.c
>
>
> 10. ListSlotDatabaseOIDs
>
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> + for (slotno = 0; slotno < max_replication_slots; slotno++)
> + {
> + ReplicationSlot *slot = &ReplicationSlotCtl->replication_slots[slotno];
>
> This is all new code so you can use C99 for loop variable declaration here.
>
> ~~~
>
> 11.
> + /* If synchronize_slot_names is '*', then skip physical slots */
> + if (SlotIsPhysical(slot))
> + continue;
> +
>
>
> Some mental gymnastics are needed to understand how this code means "
> synchronize_slot_names is '*'".
>
> IMO it would be easier to understand if the previous "if
> (numslot_names)" was rewritten as if/else.
>
> ======
> .../utils/activity/wait_event_names.txt
>
> 12.
>  RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL
> to arrive, during streaming recovery."
> +REPL_SLOTSYNC_MAIN "Waiting in main loop of worker for synchronizing
> slots to a standby from primary."
> +REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for primary to catch-up in
> worker for synchronizing slots to a standby from primary."
>  SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
>
> 12a.
> Maybe those descriptions can be simplified a bit?
>
> SUGGESTION
> REPL_SLOTSYNC_MAIN "Waiting in the main loop of slot-sync worker."
> REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for the primary to catch up, in
> slot-sync worker."
>
> ~
>
> 12b.
> typo?
>
> /REPL_SLOTSYNC_PRIMARY_CATCHP/REPL_SLOTSYNC_PRIMARY_CATCHUP/
>
> ======
> src/include/replication/walreceiver.h
>
> 13. WalRcvRepSlotDbData
>
> +/*
> + * Slot's DBid related data
> + */
> +typedef struct WalRcvRepSlotDbData
> +{
> + Oid database; /* Slot's DBid received from remote */
> +} WalRcvRepSlotDbData;
>
> Just calling this new field 'database' seems odd. Searching PG src I
> found typical fields/variables like this one are called 'databaseid',
> or 'dboid', or 'dbid', or 'db_id' etc.
>
> ======
> Kind Regards,
> Peter Smith.
> Fujitsu Australia



pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: jian he
Date:
Subject: Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges