Re: promotion related handling in pg_sync_replication_slots() - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: promotion related handling in pg_sync_replication_slots()
Date
Msg-id CALj2ACW4Tpdh4nsn7RaTB8C2NufVnNZtGYqPNwcbdaH4Ce=Yqw@mail.gmail.com
Whole thread Raw
In response to Re: promotion related handling in pg_sync_replication_slots()  (shveta malik <shveta.malik@gmail.com>)
Responses Re: promotion related handling in pg_sync_replication_slots()
List pgsql-hackers
On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> Please find v2. Changes are:

Thanks for the patch. Here are some comments.

1. Can we have a clear saying in the shmem variable who's syncing at
the moment? Is it a slot sync worker or a backend via SQL function?
Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"

typedef enum SlotSyncSource
{
    SLOT_SYNC_NONE,
    SLOT_SYNC_WORKER,
    SLOT_SYNC_BACKEND,
} SlotSyncSource;

Then, the check in ShutDownSlotSync can be:

+       /*
+        * Return if neither the slot sync worker is running nor the function
+        * pg_sync_replication_slots() is executing.
+        */
+       if ((SlotSyncCtx->pid == InvalidPid) &&
SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
        {

2.
SyncReplicationSlots(WalReceiverConn *wrconn)
 {
+    /*
+     * Startup process signaled the slot sync to stop, so if meanwhile user
+     * has invoked slot sync SQL function, simply return.
+     */
+    SpinLockAcquire(&SlotSyncCtx->mutex);
+    if (SlotSyncCtx->stopSignaled)
+    {
+        ereport(LOG,
+                errmsg("skipping slot synchronization as slot sync
shutdown is signaled during promotion"));
+

Unless I'm missing something, I think this can't detect if the backend
via SQL function is already half-way through syncing in
synchronize_one_slot. So, better move this check to (or also have it
there) slot sync loop that calls synchronize_one_slot. To avoid
spinlock acquisitions, we can perhaps do this check in when we acquire
the spinlock for synced flag.

    /* Search for the named slot */
    if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
    {
        bool        synced;

        SpinLockAcquire(&slot->mutex);
        synced = slot->data.synced;
        << get SlotSyncCtx->stopSignaled here >>
        SpinLockRelease(&slot->mutex);

        << do the slot sync skip check here if (stopSignaled) >>

3. Can we have a test or steps at least to check the consequences
manually one can get if slot syncing via SQL function is happening
during the promotion?

IIUC, we need to ensure there is no backend acquiring it and
performing sync while the slot sync worker is shutting down/standby
promotion is occuring. Otherwise, some of the slots can get resynced
and some are not while we are shutting down the slot sync worker as
part of the standby promotion which might leave the slots in an
inconsistent state.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Next
From: jian he
Date:
Subject: Re: remaining sql/json patches