Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id ZfldthofFLdPibEM@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
Hi,

On Tue, Mar 19, 2024 at 10:56:25AM +0530, Amit Kapila wrote:
> On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > Agree. While it makes sense to invalidate slots for wal removal in
> > CreateCheckPoint() (because this is the place where wal is removed), I 'm not
> > sure this is the right place for the 2 new cases.
> >
> > Let's focus on the timeout one as proposed above (as probably the simplest one):
> > as this one is purely related to time and activity what about to invalidate them
> > when?:
> >
> > - their usage resume
> > - in pg_get_replication_slots()
> >
> > The idea is to invalidate the slot when one resumes activity on it or wants to
> > get information about it (and among other things wants to know if the slot is
> > valid or not).
> >
> 
> Trying to invalidate at those two places makes sense to me but we
> still need to cover the cases where it takes very long to resume the
> slot activity and the dangling slot cases where the activity is never
> resumed.

I understand it's better to have the slot reflecting its real status internally
but it is a real issue if that's not the case until the activity on it is resumed?
(just asking, not saying we should not)

> How about apart from the above two places, trying to
> invalidate in CheckPointReplicationSlots() where we are traversing all
> the slots?

I think that's a good place but there is still a window of time (that could also
be "large" depending of the activity and the checkpoint frequency) during which
the slot is not known as invalid internally. But yeah, at leat we know that we'll
mark it as invalid at some point...

BTW:

        if (am_walsender)
        {
+               if (slot->data.persistency == RS_PERSISTENT)
+               {
+                       SpinLockAcquire(&slot->mutex);
+                       slot->data.last_inactive_at = GetCurrentTimestamp();
+                       slot->data.inactive_count++;
+                       SpinLockRelease(&slot->mutex);

I'm also feeling the same concern as Shveta mentioned in [1]: that a "normal"
backend using pg_logical_slot_get_changes() or friends would not set the
last_inactive_at.

[1]: https://www.postgresql.org/message-id/CAJpy0uD64X%3D2ENmbHaRiWTKeQawr-rbGoy_GdhQQLVXzUSKTMg%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Peter Eisentraut
Date:
Subject: Re: Inconsistent printf placeholders