Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From shveta malik
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAJpy0uArBc_aTkBn28tKfHbhrzxQeHCCyzZbt_VZumKEWKbHZQ@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Responses RE: Synchronizing slots from primary to standby
List pgsql-hackers
On Thu, Feb 8, 2024 at 4:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Few comments on 0001
> ===================

Thanks Amit. Addressed these in v81.

> 1.
> + * the slots on the standby and synchronize them. This is done on every call
> + * to SQL function pg_sync_replication_slots.
> >
>
> I think the second sentence can be slightly changed to: "This is done
> by a call to SQL function pg_sync_replication_slots." or "One can call
> SQL function pg_sync_replication_slots to invoke this functionality."

Done.

> 2.
> +update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
> {
> ...
> + SpinLockAcquire(&slot->mutex);
> + slot->data.plugin = plugin_name;
> + slot->data.database = remote_dbid;
> + slot->data.two_phase = remote_slot->two_phase;
> + slot->data.failover = remote_slot->failover;
> + slot->data.restart_lsn = remote_slot->restart_lsn;
> + slot->data.confirmed_flush = remote_slot->confirmed_lsn;
> + slot->data.catalog_xmin = remote_slot->catalog_xmin;
> + slot->effective_catalog_xmin = remote_slot->catalog_xmin;
> + SpinLockRelease(&slot->mutex);
> +
> + if (remote_slot->catalog_xmin != slot->data.catalog_xmin)
> + ReplicationSlotsComputeRequiredXmin(false);
> +
> + if (remote_slot->restart_lsn != slot->data.restart_lsn)
> + ReplicationSlotsComputeRequiredLSN();
> ...
> }
>
> How is it possible that after assigning the values from remote_slot
> they can differ from local slot values?

It was a mistake while comment fixing in previous versions. Corrected
it now. Thanks for catching.

> 3.
> + /*
> + * Find the oldest existing WAL segment file.
> + *
> + * Normally, we can determine it by using the last removed segment
> + * number. However, if no WAL segment files have been removed by a
> + * checkpoint since startup, we need to search for the oldest segment
> + * file currently existing in XLOGDIR.
> + */
> + oldest_segno = XLogGetLastRemovedSegno() + 1;
> +
> + if (oldest_segno == 1)
> + oldest_segno = XLogGetOldestSegno(0);
>
> I feel this way isn't there a risk that XLogGetOldestSegno() will get
> us the seg number from some previous timeline which won't make sense
> to compare segno in reserve_wal_for_local_slot. Shouldn't you need to
> fetch the current timeline and send as a parameter to this function as
> that is the timeline on which standby is communicating with primary.

Yes, modified it.

> 4.
> + if (remote_slot->confirmed_lsn > latestFlushPtr)
> + ereport(ERROR,
> + errmsg("skipping slot synchronization as the received slot sync"
>
> I think the internal errors should be reported with elog as you have
> done at other palces in the patch.

Done.

> 5.
> +synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
> {
> ...
> + /*
> + * Copy the invalidation cause from remote only if local slot is not
> + * invalidated locally, we don't want to overwrite existing one.
> + */
> + if (slot->data.invalidated == RS_INVAL_NONE)
> + {
> + SpinLockAcquire(&slot->mutex);
> + slot->data.invalidated = remote_slot->invalidated;
> + SpinLockRelease(&slot->mutex);
> +
> + /* Make sure the invalidated state persists across server restart */
> + ReplicationSlotMarkDirty();
> + ReplicationSlotSave();
> + slot_updated = true;
> + }
> ...
> }
>
> Do we need to copy the 'invalidated' from remote to local if both are
> same? I think this will happen for each slot each time because
> normally slots won't be invalidated ones, so there is needless writes.

It is not needed everytime. Optimized it. Now we copy only if
local_slot's 'invalidated' value is RS_INVAL_NONE while remote-slot's
value != RS_INVAL_NONE.

> 6.
> + * Returns TRUE if any of the slots gets updated in this sync-cycle.
> + */
> +static bool
> +synchronize_slots(WalReceiverConn *wrconn)
> ...
> ...
>
> +void
> +SyncReplicationSlots(WalReceiverConn *wrconn)
> +{
> + PG_TRY();
> + {
> + validate_primary_slot_name(wrconn);
> +
> + (void) synchronize_slots(wrconn);
>
> For the purpose of 0001, synchronize_slots() doesn't seems to use
> return value. So, I suggest to change it accordingly and move the
> return value in the required patch.

Modified it. Also changed return values of all related internal
functions which were returning slot_updated.

> 7.
> + /*
> + * The primary_slot_name is not set yet or WALs not received yet.
> + * Synchronization is not possible if the walreceiver is not started.
> + */
> + latestWalEnd = GetWalRcvLatestWalEnd();
> + SpinLockAcquire(&WalRcv->mutex);
> + if ((WalRcv->slotname[0] == '\0') ||
> + XLogRecPtrIsInvalid(latestWalEnd))
> + {
> + SpinLockRelease(&WalRcv->mutex);
> + return false;
>
> For the purpose of 0001, we should give WARNING here.

I will fix it in the next version. Sorry, I somehow missed it this time.

thanks
Shveta



pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs
Next
From: shveta malik
Date:
Subject: Re: Synchronizing slots from primary to standby