Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Introduce XID age and inactive timeout based replication slot invalidation
Date
Msg-id ZdXrtXLkjvIJMYvB@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Introduce XID age and inactive timeout based replication slot invalidation  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Introduce XID age and inactive timeout based replication slot invalidation
List pgsql-hackers
Hi,

On Wed, Feb 21, 2024 at 10:55:00AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 20, 2024 at 12:05 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> >> [...] and was able to produce something like:
> > >
> > > postgres=# select slot_name,slot_type,active,active_pid,wal_status,invalidation_reason from
pg_replication_slots;
> > >   slot_name  | slot_type | active | active_pid | wal_status | invalidation_reason
> > > -------------+-----------+--------+------------+------------+---------------------
> > >  rep1        | physical  | f      |            | reserved   |
> > >  master_slot | physical  | t      |    1482441 | unreserved | wal_removed
> > > (2 rows)
> > >
> > > does that make sense to have an "active/working" slot "ivalidated"?
> >
> > Thanks. Can you please provide the steps to generate this error? Are
> > you setting max_slot_wal_keep_size on primary to generate
> > "wal_removed"?
> 
> I'm able to reproduce [1] the state [2] where the slot got invalidated
> first, then its wal_status became unreserved, but still the slot is
> serving after the standby comes up online after it catches up with the
> primary getting the WAL files from the archive. There's a good reason
> for this state -
>
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/slotfuncs.c;h=d2fa5e669a32f19989b0d987d3c7329851a1272e;hb=ff9e1e764fcce9a34467d614611a34d4d2a91b50#l351.
> This intermittent state can only happen for physical slots, not for
> logical slots because logical subscribers can't get the missing
> changes from the WAL stored in the archive.
> 
> And, the fact looks to be that an invalidated slot can never become
> normal but still can serve a standby if the standby is able to catch
> up by fetching required WAL (this is the WAL the slot couldn't keep
> for the standby) from elsewhere (archive via restore_command).
> 
> As far as the 0001 patch is concerned, it reports the
> invalidation_reason as long as slot_contents.data.invalidated !=
> RS_INVAL_NONE. I think this is okay.
> 
> Thoughts?

Yeah, looking at the code I agree that looks ok. OTOH, that looks confusing,
maybe we should add a few words about it in the doc?

Looking at v5-0001:

+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>invalidation_reason</structfield> <type>text</type>
+      </para>
+      <para>

My initial thought was to put "conflict" value in this new field in case of
conflict (not to mention the conflict reason in it). With the current proposal
invalidation_reason could report the same as conflict_reason, which sounds weird
to me.

Does that make sense to you to use "conflict" as value in "invalidation_reason"
when the slot has "conflict_reason" not NULL?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: 'Shutdown <= SmartShutdown' check while launching processes in postmaster.
Next
From: Tomas Vondra
Date:
Subject: Re: Shared detoast Datum proposal