Home > mailing lists

Re: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers

From	Chao Li
Subject	Re: Assertion failure in SnapBuildInitialSnapshot()
Date	December 30, 2025 06:54:43
Msg-id	7292D035-0AFC-4E79-BF8E-2D979D075FDA@gmail.com Whole thread Raw
In response to	Re: Assertion failure in SnapBuildInitialSnapshot() (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses	Re: Assertion failure in SnapBuildInitialSnapshot()
List	pgsql-hackers

Tree view


> On Dec 30, 2025, at 06:14, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 7:19 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
>>
>> On Friday, December 19, 2025 3:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>
>>> On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
>>> wrote:
>>>>
>>>> On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada
>>> <sawada.mshk@gmail.com> wrote:
>>>>>
>>>>> On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
>>>>> <houzj.fnst@fujitsu.com> wrote:
>>>>>>
>>>>>> On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada
>>>>> <sawada.mshk@gmail.com> wrote:
>>>>>>> Right. But the following scenario seems to happen:
>>>>>>>
>>>>>>> 1. Both processes have a slot with effective_catalog_xmin = 100.
>>>>>>> 2. Process-A updates effective_catalog_xmin to 150, and computes
>>>>>>> the
>>>>> new
>>>>>>> catalog_xmin as 100 because process-B slot still has
>>>>> effective_catalog_xmin =
>>>>>>> 100.
>>>>>>> 3. Process-B updates effective_catalog_xmin to 150, and computes
>>>>>>> the
>>>>> new
>>>>>>> catalog_xmin as 150.
>>>>>>> 4. Process-B updates procArray->replication_slot_catalog_xmin to
>>> 150.
>>>>>>> 5. Process-A updates procArray->replication_slot_catalog_xmin to
>>> 100.
>>>>>>
>>>>>> I think this scenario can occur, but is not harmful. Because the
>>>>>> catalog rows removed prior to xid:150 would no longer be used, as
>>>>>> both slots have
>>>>> advanced
>>>>>> their catalog_xmin and flushed the value to disk. Therefore, even
>>>>>> if replication_slot_catalog_xmin regresses, it should be OK.
>>>>>>
>>>>>> Considering all above, I think allowing concurrent xmin
>>>>>> computation, as the patch does, is acceptable. What do you think ?
>>>>>
>>>>> I agree with your analysis. Another thing I'd like to confirm is
>>>>> that in an extreme case, if the server crashes suddenly after
>>>>> removing catalog tuples older than XID 100 and logical decoding
>>>>> restarts, it ends up missing necessary catalog tuples? I think it's
>>>>> not a problem as long as the subscriber knows the next commit LSN
>>>>> they want but could it be problematic if the user switches to use
>>>>> the logical decoding SQL API? I might be worrying too much, though.
>>>>
>>>> I think this case is not a problem because:
>>>>
>>>> In LogicalConfirmReceivedLocation, the updated restart_lsn and
>>>> catalog_xmin are flushed to disk before the effective_catalog_xmin is
>>>> updated. Thus, once replication_slot_catalog_xmin advances to a
>>>> certain value, even in the event of a crash, users won't encounter any
>>>> removed tuples when consuming from slots after a restart. This is
>>>> because all slots have their updated restart_lsn flushed to disk,
>>>> ensuring that upon restarting, changes are decoded from the updated
>>> position where older catalog tuples are no longer needed.
>>>
>>> Agreed.
>>>
>>>>
>>>> BTW, I assume you meant catalog tuples older than XID 150 are removed,
>>>> since in the previous example, tuples older than XID 100 are already not
>>> useful.
>>>
>>> Right. Thank you for pointing this out.
>>>
>>> I think we can proceed with the idea proposed by Hou-san. Regarding the
>>> patch, while it mostly looks good, it might be worth considering adding more
>>> comments:
>>>
>>> - If the caller passes already_locked=true to
>>> ReplicationSlotsComputeRequiredXmin(), the lock order should also be
>>> considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
>>> - ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
>>> process, resulting in temporarily moving
>>> procArray->replication_slot_catalog_xmin backward, but it's harmless
>>> because a smaller catalog_xmin is conservative: it merely prevents VACUUM
>>> from removing catalog tuples that could otherwise be pruned. It does not lead
>>> to premature deletion of required data.
>>
>> Thanks for the comments. I added some more comments as suggested.
>>
>> Here is the updated patch.
>
> Thank you for updating the patch! The patch looks good to me.
>
> I've made minor changes to the comment and commit message and created
> patches for backbranches. I'm going to push them, barring any
> objections.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
>
<master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch>


I’ve just looked through the patch for master. The fix itself looks solid to me. I only noticed a few minor comment
nits:

1
```
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
```

acquire -> acquires

2
```
+     * values, so no backend update the initial xmin for newly created slot
```

Update -> updates

3
```
+     * slot machinery about the new limit. Once that's done the both locks
```

“The both locks”, feels like “the” is not needed.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

pgsql-hackers by date:

From: Chao Li
Date: 30 December 2025, 06:24:37
Subject: Re: Implement waiting for wal lsn replay: reloaded

From: Michael Paquier
Date: 30 December 2025, 07:31:30
Subject: Re: Refactor query normalization into core query jumbling

Re: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers

Previous

Next