Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id CALj2ACWqLneo2kqP5DEUwDXyHoM5zPn2vE+27oB+Cse9Rvzo_A@mail.gmail.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
Hi,

Thanks for reviewing.

On Tue, Sep 3, 2024 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> 1)
> I see that ReplicationSlotAlter() will error out if the slot is
> invalidated due to timeout. I have not tested it myself, but do you
> know if  slot-alter errors out for other invalidation causes as well?
> Just wanted to confirm that the behaviour is consistent for all
> invalidation causes.

Will respond to Amit's comment soon.

> 2)
> When a slot is invalidated, and we try to use that slot, it gives this msg:
>
> ERROR:  can no longer get changes from replication slot "mysubnew1_2"
> DETAIL:  The slot became invalid because it was inactive since
> 2024-09-03 14:23:34.094067+05:30, which is more than 600 seconds ago.
> HINT:  You might need to increase "replication_slot_inactive_timeout.".
>
> Isn't HINT misleading? Even if we increase it now, the slot can not be
> reused again.
>
> Below is one side effect if inactive_since keeps on changing:
>
> postgres=# SELECT * FROM pg_replication_slot_advance('mysubnew1_1',
> pg_current_wal_lsn());
> ERROR:  can no longer get changes from replication slot "mysubnew1_1"
> DETAIL:  The slot became invalid because it was inactive since
> 2024-09-04 10:03:56.68053+05:30, which is more than 10 seconds ago.
> HINT:  You might need to increase "replication_slot_inactive_timeout.".
>
> postgres=# select now();
>                now
> ---------------------------------
>  2024-09-04 10:04:00.26564+05:30
>
> 'DETAIL' gives wrong information, we are not past 10-seconds. This is
> because inactive_since got updated even in ERROR scenario.
>
> ERROR:  can no longer get changes from replication slot "mysubnew1_1"
> DETAIL:  The slot became invalid because it was inactive since
> 2024-09-04 10:06:38.980939+05:30, which is more than 129600 seconds
> ago.
> postgres=# select now();
>                now
> ----------------------------------
>  2024-09-04 10:07:35.201894+05:30
>
> I feel we should change this message itself.

Removed the hint and corrected the detail message as following:

errmsg("can no longer get changes from replication slot \"%s\"",
NameStr(s->data.name)),
errdetail("This slot has been invalidated because it was inactive for
longer than the amount of time specified by \"%s\".",
"replication_slot_inactive_timeout.")));

> 3)
> When the slot is invalidated, the' inactive_since' still keeps on
> changing when there is a subscriber trying to start replication
> continuously. I think ReplicationSlotAcquire() keeps on failing and
> thus Release keeps on setting it again and again. Shouldn't we stop
> setting/chnaging  'inactive_since' once the slot is invalidated
> already, otherwise it will be misleading.
>
> postgres=# select failover,synced,inactive_since,invalidation_reason
> from pg_replication_slots;
>
>  failover | synced |          inactive_since          | invalidation_reason
> ----------+--------+----------------------------------+---------------------
>  t        | f      | 2024-09-03 14:23:.. | inactive_timeout
>
> after sometime:
>  failover | synced |          inactive_since          | invalidation_reason
> ----------+--------+----------------------------------+---------------------
>  t        | f      | 2024-09-03 14:26:..| inactive_timeout

Changed it to not update inactive_since for slots invalidated due to
inactive timeout.

> 4)
> src/sgml/config.sgml:
>
> 4a)
> + A value of zero (which is default) disables the timeout mechanism.
>
> Better will be:
> A value of zero (which is default) disables the inactive timeout
> invalidation mechanism .

Changed.

> 4b)
> 'synced' and inactive_since should point to pg_replication_slots:
>
> example:
> <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>synced</structfield>

Modified.

> 5)
> src/sgml/system-views.sgml:
> + ..the slot has been inactive for longer than the duration specified
> by replication_slot_inactive_timeout parameter.
>
> Better to have:
> ..the slot has been inactive for a time longer than the duration
> specified by the replication_slot_inactive_timeout parameter.

Changed it to the following to be consistent with the config.sgml.

          <literal>inactive_timeout</literal> means that the slot has been
          inactive for longer than the amount of time specified by the
          <xref linkend="guc-replication-slot-inactive-timeout"/> parameter.

Please find the v45 patch posted upthread at
https://www.postgresql.org/message-id/CALj2ACWXQT3_HY40ceqKf1DadjLQP6b1r%3D0sZRh-xhAOd-b0pA%40mail.gmail.com
for the changes.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Oliver Ford
Date:
Subject: Re: [PoC] Add CANONICAL option to xmlserialize
Next
From: Stepan Neretin
Date:
Subject: Re: Sort functions with specialized comparators