Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers
From | Ashutosh Sharma |
---|---|
Subject | Re: How can end users know the cause of LR slot sync delays? |
Date | |
Msg-id | CAE9k0Pmh86ctxaOQ0QZkt0gmg+pJbu34w-maG=NoJXfbR80hoA@mail.gmail.com Whole thread Raw |
In response to | Re: How can end users know the cause of LR slot sync delays? (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: How can end users know the cause of LR slot sync delays?
|
List | pgsql-hackers |
Hi Amit,
On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> We have seen cases where slot synchronization gets delayed, for example when the slot is behind the failover standby or vice versa, and the slot sync worker has to wait for one to catch up with the other. During this waiting period, users querying pg_replication_slots can only see whether the slot has been synchronized or not. If it has already synchronized, that’s fine, but if synchronization is taking longer, users would naturally want to understand the reason for the delay.
>
> Is there a way for end users to know the cause of slot synchronization delays, so they can take appropriate actions to speed it up?
>
> I understand that server logs are emitted in such cases, but logs are not something end users would want to check regularly. Moreover, since logging is configuration-based, relevant messages may sometimes be skipped or suppressed.
>
Currently, the way to see the reason for sync skip is LOGs but I think
it is better to add a new column like sync_skip_reason in
pg_replication_slots. This can show the reasons like
standby_LSN_ahead_remote_LSN. I think ideally users can compare
standby's slot LSN/XMIN with remote_slot being synced. Do you have any
better ideas?
Step 1: Define an enum for all possible reasons a slot persistence was skipped.
/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */
RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */
RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, risk of data loss */
RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a consistent snapshot */
} ReplicationSlotSyncSkipReason;
/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */
RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */
RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, risk of data loss */
RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a consistent snapshot */
} ReplicationSlotSyncSkipReason;
--
Step 2: Introduce new column to pg_replication_slots to store the skip reason
/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;
--
Step 3: Function to convert enum to human-readable string that can be stored in pg_replication_slots.
/*
* Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
*
* Returns a palloc'd string; caller is responsible for freeing it.
*/
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
{
StringInfoData buf;
initStringInfo(&buf);
if (reason == RS_SYNC_SKIP_NONE)
{
appendStringInfoString(&buf, "none");
return buf.data;
}
if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
appendStringInfoString(&buf, "remote_behind|");
if (reason & RS_SYNC_SKIP_DATA_LOSS)
appendStringInfoString(&buf, "data_loss|");
if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
appendStringInfoString(&buf, "no_snapshot|");
/* Remove trailing '|' */
if (buf.len > 0 && buf.data[buf.len - 1] == '|')
buf.data[buf.len - 1] = '\0';
return buf.data;
}
--
Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages are generated, primarily inside update_local_synced_slot or update_and_persist_local_synced_slot. This value will can later be persisted in the pg_replication_slots catalog.
Step 2: Introduce new column to pg_replication_slots to store the skip reason
/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;
--
Step 3: Function to convert enum to human-readable string that can be stored in pg_replication_slots.
/*
* Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
*
* Returns a palloc'd string; caller is responsible for freeing it.
*/
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
{
StringInfoData buf;
initStringInfo(&buf);
if (reason == RS_SYNC_SKIP_NONE)
{
appendStringInfoString(&buf, "none");
return buf.data;
}
if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
appendStringInfoString(&buf, "remote_behind|");
if (reason & RS_SYNC_SKIP_DATA_LOSS)
appendStringInfoString(&buf, "data_loss|");
if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
appendStringInfoString(&buf, "no_snapshot|");
/* Remove trailing '|' */
if (buf.len > 0 && buf.data[buf.len - 1] == '|')
buf.data[buf.len - 1] = '\0';
return buf.data;
}
--
Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages are generated, primarily inside update_local_synced_slot or update_and_persist_local_synced_slot. This value will can later be persisted in the pg_replication_slots catalog.
--
Please let me know if you have any objections. I’ll share the wip patch in a few days.
--
With Regards,
Ashutosh Sharma.
pgsql-hackers by date: