Introduce XID age based replication slot invalidation - Mailing list pgsql-hackers

From John H
Subject Introduce XID age based replication slot invalidation
Date
Msg-id CA+-JvFsMHckBMzsu5Ov9HCG3AFbMh056hHy1FiXazBRtZ9pFBg@mail.gmail.com
Whole thread Raw
Responses RE: Introduce XID age based replication slot invalidation
Re: Introduce XID age based replication slot invalidation
List pgsql-hackers
Hi folks,

I'd like to restart the discussion about providing an xid-based slot
invalidation mechanism. The previous effort [1]  presented an XID and
time-based invalidation and the inactive time-based approach was
implemented first. The latest XID based patch from Bharath Rupireddy
can be found here [2].

When thinking about availability of the database, inactive replication
slots cause two main pain points:
1) WAL accumulation
2) Replication slots with xmin/catalog_xmin can hold back vacuuming
leading to wrap-around

The first issue can be mitigated by 'max_slot_wal_keep_size'. However
in the second case there are no good mechanisms to prioritize write
availability of the database and avoid wraparound. The new GUC
'idle_replication_slot_timeout' partially addresses the concern if you
have similar workloads. However it's hard to set the same setting
across a fleet of different applications.

It's easy to imagine a high-XID churning workload in one cluster while
another has large batch jobs where changes get synced out
periodically. There isn't a "one-size" fits all setting for
'idle_replication_slot_timeout' in these two cases.

The attached patch addresses this by introducing 'max_slot_xid_age' in
a similar fashion. Replication slots with transaction ID greater than
the set age will get invalidated allowing vacuum to proceed, biasing
towards database availability.

Invalidation happens in CHECKPOINT, similar to
'idle_replication_slot_timeout', and when VACUUM occurs.

The patch currently attempts to invalidate once-per-autovacuum worker.
We're wondering if it should attempt invalidation on a per-relation
basis within the vacuum call itself. That would account for scenarios
where the cost_delay or naptime is high between autovac executions.

Thanks,

John H

[1] https://www.postgresql.org/message-id/flat/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe%2Baw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CALj2ACXe8%2BxSNdMXTMaSRWUwX7v61Ad4iddUwnn%3DdjSwx3GLLg%40mail.gmail.com

-- 
John Hsu - Amazon Web Services

Attachment

pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Updating IPC::Run in CI?
Next
From: Robert Haas
Date:
Subject: Re: plan shape work