RE: Newly created replication slot may be invalidated by checkpoint - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Newly created replication slot may be invalidated by checkpoint
Date
Msg-id TY4PR01MB169071C25B2288ECD746AA65094A3A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: Newly created replication slot may be invalidated by checkpoint  ("Vitaly Davydov" <v.davydov@postgrespro.ru>)
List pgsql-hackers
On Tuesday, December 9, 2025 12:40 AM Vitaly Davydov <v.davydov@postgrespro.ru> wrote:
> 
> On Monday, December 08, 2025 13:24 MSK, "Zhijie Hou (Fujitsu)"
> <houzj.fnst@fujitsu.com> wrote:
> 
> > On Monday, December 8, 2025 5:47 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > > > Sawada-san/Vitaly, do you have any opinion on patch or the
> > > > > direction to fix? The idea is to get this fixed for HEAD and 18,
> > > > > then continue discussion for other bank-branches and the remaining
> patches.
> 
> Hi Amit, Zhijie Hou
> 
> Thank you for preparing and comiting 0001 patch. I'm ok with it. I did some
> auto testing of the patch and haven't found any problems. As I realized,
> another two patches (0002, 0003) are still in review.

Thanks for testing!

> 
> In my previous email I wrote about copy_replication_slot, where restart_lsn is
> assigned without any locks, but I'm not sure that email was successfully
> delivered. Masahiko Sagawa mentioned about it in one of the latest emails as
> well. I also read the answer but not completely understood it at the moment,
> sorry (need some more time to investigate). Anyway, I would prefer to use
> locks in create_physical_replication_slot rather than rely on signals handling
> which may be changed in the future.

If we want to improve that, taking lock when updating restart_lsn does not work,
because the initial restart_lsn is an old position copied from another slot,
where the WALs could have already been removed, so unlike the mechanism
mentioned in 006dd4b2, the lock cannot ensure the same. I think we might need
to some other solutions for it which can be discussed separately.

> 
> One more thing, when we copy a logical replication slot,
> DecodingContextFindStartpoint reads the WAL from the specified restart_lsn
> which may be removed by a concurrent checkpoint. It can produce an error
> and stop slot copying, I guess. This behaviour may be not desirable.

Given that we don't search for a starting point for copied slots (we
pass find_startpoint=false when copying), I don't think this issue exists.

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Let's add a test for NLS translation of PRI* macros
Next
From: Chao Li
Date:
Subject: Re: [Proposal] Adding callback support for custom statistics kinds