RE: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Assertion failure in SnapBuildInitialSnapshot()
Date
Msg-id TY4PR01MB1690784F8EFFDC149FE67FAF594CDA@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Assertion failure in SnapBuildInitialSnapshot()  (Pradeep Kumar <spradeepkumar29@gmail.com>)
List pgsql-hackers
On Wednesday, November 12, 2025 7:27 PM Pradeep Kumar <spradeepkumar29@gmail.com>  wote:
> I've been investigating the assert failure in
> ProcArraySetReplicationSlotXmin() and would like to share my approach and get
> feedback. Instead of inverting the locks and what robert shared before [1].
> Instead of unconditionally updating procArray->replication_slot_xmin in
> ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates
> conditional:
> 1) Only update if the incoming xmin is valid
> 2) Only update if it's older than the currently stored xmin
> 3) Do the same for procArray->replication_slot_catalog_xmin
...
> In above block of code ensures we always track the minimum xmin across all
> active replication slots without losing data. And also no need to worry about
> locks. And also while reproducing this issue [2] In SnapBuildInitialSnapshot()
> while we computing safexid by calling GetOldestSafeDecodingTransactionId(false)
> will enters into first case and update the oldestSafeXid =
> procArray->replication_slot_xmin. So it won't return nextXid. And also it solves
> this issue [2].

Thanks for evaluating new approach, but I think this approach could not work
because we expect replication_slot_xmin to be set to an invalid number when the
last slot is dropped, while this approach would disallow that, causing WALs to
be retained. For a detailed explanation, please refer to [1].

While testing the patches across all branches, I noticed that an additional lock
needs to be added in the launcher.c where
ReplicationSlotsComputeRequiredXmin(true) was recently added for conflict
detection slot. I have modified the original patch accordingly.

BTW, I am not adding a test using an injection point because it does not seem
practical to insert an injection point inner
ReplicationSlotsComputeRequiredXmin. The reason is that the injection point
function internally calls CHECK_FOR_INTERRUPTS(), but the key functions in the
patch holds the lwlock, holding holds interrupts.

I am sharing the patches for all branches for reference.

[1]
https://www.postgresql.org/message-id/TY4PR01MB169070EE618FA2908B3D2F2AE94C3A%40TY4PR01MB16907.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Shinya Kato
Date:
Subject: Re: Add mode column to pg_stat_progress_vacuum
Next
From: Chao Li
Date:
Subject: Re: Suggestion to add --continue-client-on-abort option to pgbench