RE: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id TYAPR01MB56927564EEE26E5433198405F5292@TYAPR01MB5692.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
Dear Nisha,

> 
> Attached v51 patch-set addressing all comments in [1] and [2].
>

Thanks for working on the feature! I've stated to review the patch.
Here are my comments - sorry if there are something which have already been discussed.
The thread is too long to follow correctly.

Comments for 0001
=============

01. binary_upgrade_logical_slot_has_caught_up

ISTM that error_if_invalid is set to true when the slot can be moved forward, otherwise
it is set to false. Regarding the binary_upgrade_logical_slot_has_caught_up, however,
only valid slots will be passed to the funciton (see pg_upgrade/info.c) so I feel
it is OK to set to true. Thought?

02. ReplicationSlotAcquire

According to other functions, we are adding to a note to the translator when
parameters represent some common nouns, GUC names. I feel we should add a comment
for RS_INVAL_WAL_LEVEL part based on it.


Comments for 0002
=============

03. check_replication_slot_inactive_timeout

Can we overwrite replication_slot_inactive_timeout to zero when pg_uprade (and also
pg_createsubscriber?) starts a server process? Several parameters have already been
specified via -c option at that time. This can avoid an error while the upgrading.
Note that this part is still needed even if you accept the comment. Users can
manually boot with upgrade mode.

04. ReplicationSlotAcquire

Same comment as 02.

05. ReportSlotInvalidation

Same comment as 02.

06. found bug

While testing the patch, I found that slots can be invalidated too early when when
the GUC is quite large. I think because an overflow is caused in InvalidatePossiblyObsoleteSlot().

- Reproducer

I set the replication_slot_inactive_timeout to INT_MAX and executed below commands,
and found that the slot is invalidated.

```
postgres=# SHOW replication_slot_inactive_timeout;
 replication_slot_inactive_timeout 
-----------------------------------
 2147483647s
(1 row)
postgres=# SELECT * FROM pg_create_logical_replication_slot('test', 'test_decoding');
 slot_name |    lsn    
-----------+-----------
 test      | 0/18B7F38
(1 row)
postgres=# CHECKPOINT ;
CHECKPOINT
postgres=# SELECT slot_name, inactive_since, invalidation_reason FROM pg_replication_slots ;
 slot_name |        inactive_since         | invalidation_reason 
-----------+-------------------------------+---------------------
 test      | 2024-11-28 07:50:25.927594+00 | inactive_timeout
(1 row)
```

- analysis

In InvalidatePossiblyObsoleteSlot(), replication_slot_inactive_timeout_sec * 1000
is passed to the third argument of TimestampDifferenceExceeds(), which is also the
integer datatype. This causes an overflow and parameter is handled as the small
value.

- solution

I think there are two possible solutions. You can choose one of them:

a. Make the maximum INT_MAX/1000, or
b. Change the unit to millisecond.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: Anthonin Bonnefoy
Date:
Subject: Re: Consider pipeline implicit transaction as a transaction block
Next
From: "Andrey M. Borodin"
Date:
Subject: Re: UUID v7