Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From shveta malik
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id CAJpy0uC8Dg-0JS3NRUwVUemgz5Ar2v3_EQQFXyAigWSEQ8U47Q@mail.gmail.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
On Sat, Aug 31, 2024 at 1:45 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
>
> Please find the attached v44 patch with the above changes. I will
> include the 0002 xid_age based invalidation patch later.
>

Thanks for the patch Bharath. My review and testing is WIP, but please
find few comments and queries:

1)
I see that ReplicationSlotAlter() will error out if the slot is
invalidated due to timeout. I have not tested it myself, but do you
know if  slot-alter errors out for other invalidation causes as well?
Just wanted to confirm that the behaviour is consistent for all
invalidation causes.

2)
When a slot is invalidated, and we try to use that slot, it gives this msg:

ERROR:  can no longer get changes from replication slot "mysubnew1_2"
DETAIL:  The slot became invalid because it was inactive since
2024-09-03 14:23:34.094067+05:30, which is more than 600 seconds ago.
HINT:  You might need to increase "replication_slot_inactive_timeout.".

Isn't HINT misleading? Even if we increase it now, the slot can not be
reused again.


3)
When the slot is invalidated, the' inactive_since' still keeps on
changing when there is a subscriber trying to start replication
continuously. I think ReplicationSlotAcquire() keeps on failing and
thus Release keeps on setting it again and again. Shouldn't we stop
setting/chnaging  'inactive_since' once the slot is invalidated
already, otherwise it will be misleading.

postgres=# select failover,synced,inactive_since,invalidation_reason
from pg_replication_slots;

 failover | synced |          inactive_since          | invalidation_reason
----------+--------+----------------------------------+---------------------
 t        | f      | 2024-09-03 14:23:.. | inactive_timeout

after sometime:
 failover | synced |          inactive_since          | invalidation_reason
----------+--------+----------------------------------+---------------------
 t        | f      | 2024-09-03 14:26:..| inactive_timeout


4)
src/sgml/config.sgml:

4a)
+ A value of zero (which is default) disables the timeout mechanism.

Better will be:
A value of zero (which is default) disables the inactive timeout
invalidation mechanism .
or
A value of zero (which is default) disables the slot invalidation due
to the inactive timeout mechanism.

i.e. rephrase to indicate that invalidation is disabled.

4b)
'synced' and inactive_since should point to pg_replication_slots:

example:
<link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>synced</structfield>

5)
src/sgml/system-views.sgml:
+ ..the slot has been inactive for longer than the duration specified
by replication_slot_inactive_timeout parameter.

Better to have:
..the slot has been inactive for a time longer than the duration
specified by the replication_slot_inactive_timeout parameter.

thanks
Shveta



pgsql-hackers by date:

Previous
From: Jehan-Guillaume de Rorthais
Date:
Subject: Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
Next
From: Peter Eisentraut
Date:
Subject: Re: altering a column's collation leaves an invalid foreign key