Separate catalog_xmin from xmin in walsender hot standby feedback - Mailing list pgsql-hackers

From Rui Zhao
Subject Separate catalog_xmin from xmin in walsender hot standby feedback
Date
Msg-id 3544CABD-2D09-4D7F-AFB7-2C75A09C4E22@gmail.com
Whole thread
List pgsql-hackers
Hi hackers,

I'd like to propose a fix for a long-standing issue where hot standby
feedback catalog_xmin incorrectly holds back vacuuming of user data
tables on the primary when no physical replication slot is used.

== Problem ==

When a standby sends hot standby feedback to a primary without a
physical replication slot, ProcessStandbyHSFeedbackMessage() takes
min(feedbackCatalogXmin, feedbackXmin) and stores it into
MyProc->xmin:

    if (TransactionIdIsNormal(feedbackCatalogXmin)
        && TransactionIdPrecedes(feedbackCatalogXmin, feedbackXmin))
        MyProc->xmin = feedbackCatalogXmin;
    else
        MyProc->xmin = feedbackXmin;

Since ComputeXidHorizons() treats proc->xmin uniformly for both data
and catalog horizons, the catalog_xmin ends up holding back
data_oldest_nonremovable, preventing vacuum from cleaning dead tuples
in regular user tables.

The existing code even acknowledges this limitation:

    "We can only track the catalog xmin separately when using a slot,
     so we store the least of the two provided when not using a slot."

== Why this matters ==

One might argue "just use a replication slot."  However, many
production HA deployments intentionally avoid physical replication
slots because of their lifecycle management complexity:

- When a primary fails, physical slots on the old primary are lost
  and cannot be automatically migrated to the promoted standby.
- Other standbys that were using slots on the old primary must
  re-establish their slots on the new primary, potentially requiring
  a fresh base backup.
- Dangling slots from disconnected standbys can cause unbounded WAL
  accumulation until manually dropped.

These deployments use wal_keep_size or WAL archiving for WAL
retention, combined with hot_standby_feedback for visibility horizon
management.  This is a legitimate production configuration -- for
example, some HA frameworks (Patroni with certain configurations,
custom HA scripts) operate this way.

The issue becomes severe when the standby also hosts a logical
replication slot (e.g., for change data capture or logical replication
to a downstream).  The logical slot's catalog_xmin can be very old
(retained for logical decoding catalog access), and this old value
gets propagated to the primary's walsender via hot standby feedback,
blocking vacuum on ALL user data tables on the primary.  This leads
to table bloat that is difficult to diagnose since the DBA may not
realize the connection between a standby's logical slot and the
primary's vacuum behavior.

== Fix ==

The patch adds a catalog_xmin field to PGPROC (4 bytes), so the
walsender can track catalog_xmin separately from xmin even without a
replication slot.  This mirrors how replication slots already separate
slot->data.xmin from slot->data.catalog_xmin.

In ComputeXidHorizons(), the new proc_catalog_xmin is accumulated
from PGPROC entries and applied only to catalog_oldest_nonremovable
and shared_oldest_nonremovable -- exactly how slot_catalog_xmin is
already handled.  It does NOT affect data_oldest_nonremovable.

GetReplicationHorizons() is updated to include proc_catalog_xmin in
the catalog_xmin sent upstream, ensuring correct behavior in
cascading standby configurations.

Changes summary:
- proc.h: add catalog_xmin to PGPROC
- proc.c: initialize catalog_xmin in InitProcess/InitAuxiliaryProcess
- procarray.c: accumulate and apply proc_catalog_xmin in
  ComputeXidHorizons(); include in GetReplicationHorizons()
- walsender.c: set MyProc->xmin and MyProc->catalog_xmin separately
  in the no-slot path of ProcessStandbyHSFeedbackMessage()

== Alternatives considered ==

1. Generalize the ephemeral slot concept (as suggested by the existing
   XXX comment): this would automatically create a temporary slot for
   slot-less walsenders.  More invasive, requires slot allocation
   (max_replication_slots), and adds slot lifecycle management.

2. Simply ignore catalog_xmin in the no-slot path: simpler but loses
   catalog protection for the standby's logical decoding.

The proposed approach is minimal, correct, and consistent with how
slots already handle the separation.

== Testing ==

A new TAP test (053_hs_feedback_catalog_xmin.pl) verifies:
1. With hot_standby_feedback=on and no physical replication slot, when
   the standby has a logical slot with an old catalog_xmin, VACUUM on
   the primary can still clean dead tuples in user data tables.
2. The standby's logical slot catalog_xmin remains properly set,
   confirming catalog protection is preserved.

Patch attached.

Regards,
Rui Zhao





Attachment

pgsql-hackers by date:

Previous
From: Mats Kindahl
Date:
Subject: pg_rewind does not rewind diverging timelines
Next
From: Chao Li
Date:
Subject: Re: Make transformAExprIn() return a flattened bool expression directly