Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id CALj2ACWw9ohERck7Vm0oRvbGngPyp4ux2TKDwj4H-X7jN840JA@mail.gmail.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Sat, Jan 27, 2024 at 1:18 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 10:48 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Hi,
> >
> > Replication slots in postgres will prevent removal of required
> > resources when there is no connection using them (inactive). This
> > consumes storage because neither required WAL nor required rows from
> > the user tables/system catalogs can be removed by VACUUM as long as
> > they are required by a replication slot. In extreme cases this could
> > cause the transaction ID wraparound.
> >
> > Currently postgres has the ability to invalidate inactive replication
> > slots based on the amount of WAL (set via max_slot_wal_keep_size GUC)
> > that will be needed for the slots in case they become active. However,
> > the wraparound issue isn't effectively covered by
> > max_slot_wal_keep_size - one can't tell postgres to invalidate a
> > replication slot if it is blocking VACUUM. Also, it is often tricky to
> > choose a default value for max_slot_wal_keep_size, because the amount
> > of WAL that gets generated and allocated storage for the database can
> > vary.
> >
> > Therefore, it is often easy for developers to do the following:
> > a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5
> > billion, after which the slots get invalidated.
> > b) set a timeout of say 1 or 2 or 3 days, after which the inactive
> > slots get invalidated.
> >
> > To implement (a), postgres needs a new GUC called max_slot_xid_age.
> > The checkpointer then invalidates all the slots whose xmin (the oldest
> > transaction that this slot needs the database to retain) or
> > catalog_xmin (the oldest transaction affecting the system catalogs
> > that this slot needs the database to retain) has reached the age
> > specified by this setting.
> >
> > To implement (b), first postgres needs to track the replication slot
> > metrics like the time at which the slot became inactive (inactive_at
> > timestamptz) and the total number of times the slot became inactive in
> > its lifetime (inactive_count numeric) in ReplicationSlotPersistentData
> > structure. And, then it needs a new timeout GUC called
> > inactive_replication_slot_timeout. Whenever a slot becomes inactive,
> > the current timestamp and inactive count are stored in
> > ReplicationSlotPersistentData structure and persisted to disk. The
> > checkpointer then invalidates all the slots that are lying inactive
> > for about inactive_replication_slot_timeout duration starting from
> > inactive_at.
> >
> > In addition to implementing (b), these two new metrics enable
> > developers to improve their monitoring tools as the metrics are
> > exposed via pg_replication_slots system view. For instance, one can
> > build a monitoring tool that signals when replication slots are lying
> > inactive for a day or so using inactive_at metric, and/or when a
> > replication slot is becoming inactive too frequently using inactive_at
> > metric.
> >
> > I’m attaching the v1 patch set as described below:
> > 0001 - Tracks invalidation_reason in pg_replication_slots. This is
> > needed because slots now have multiple reasons for slot invalidation.
> > 0002 - Tracks inactive replication slot information inactive_at and
> > inactive_timeout.
> > 0003 - Adds inactive_timeout based replication slot invalidation.
> > 0004 - Adds XID based replication slot invalidation.
> >
> > Thoughts?
>
> Needed a rebase due to c393308b. Please find the attached v2 patch set.

Needed a rebase due to commit 776621a (conflict in
src/test/recovery/meson.build for new TAP test file added). Please
find the attached v3 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Fabrízio de Royes Mello
Date:
Subject: Re: speed up a logical replica setup
Next
From: Alvaro Herrera
Date:
Subject: Re: Emitting JSON to file using COPY TO