Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From Ashutosh Sharma
Subject Re: How can end users know the cause of LR slot sync delays?
Date
Msg-id CAE9k0Pmh86ctxaOQ0QZkt0gmg+pJbu34w-maG=NoJXfbR80hoA@mail.gmail.com
Whole thread Raw
In response to Re: How can end users know the cause of LR slot sync delays?  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: How can end users know the cause of LR slot sync delays?
List pgsql-hackers
Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.
>
> Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?
>
> I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.
>

Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?


I have similar thoughts, but for clarity, I’d like to outline some of the key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was skipped.

/*
 * Reasons why a replication slot sync was skipped.
 */
typedef enum ReplicationSlotSyncSkipReason
{
    RS_SYNC_SKIP_NONE = 0,                 /* No skip */

    RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */
   
    RS_SYNC_SKIP_DATA_LOSS = (1 << 1),     /* Local slot ahead of remote, risk of data loss */
   
    RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2)    /* Standby could not build a consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

Step 2: Introduce new column to pg_replication_slots to store the skip reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

Step 3: Function to convert enum to human-readable string that can be stored in pg_replication_slots.

/*
 * Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
 *
 * Returns a palloc'd string; caller is responsible for freeing it.
 */
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
{
    StringInfoData buf;
    initStringInfo(&buf);

    if (reason == RS_SYNC_SKIP_NONE)
    {
        appendStringInfoString(&buf, "none");
        return buf.data;
    }

    if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
        appendStringInfoString(&buf, "remote_behind|");
    if (reason & RS_SYNC_SKIP_DATA_LOSS)
        appendStringInfoString(&buf, "data_loss|");
    if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
        appendStringInfoString(&buf, "no_snapshot|");

    /* Remove trailing '|' */
    if (buf.len > 0 && buf.data[buf.len - 1] == '|')
        buf.data[buf.len - 1] = '\0';

    return buf.data;
}

--

Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages are generated, primarily inside update_local_synced_slot or update_and_persist_local_synced_slot. This value will can later be persisted in the pg_replication_slots catalog.

--

Please let me know if you have any objections. I’ll share the wip patch in a few days.

--
With Regards,
Ashutosh Sharma.

pgsql-hackers by date:

Previous
From: Florents Tselai
Date:
Subject: Re: split func.sgml to separated individual sgml files
Next
From: Doruk Yilmaz
Date:
Subject: Re: [Patch] add new parameter to pg_replication_origin_session_setup