Re: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers
| From | Masahiko Sawada |
|---|---|
| Subject | Re: Assertion failure in SnapBuildInitialSnapshot() |
| Date | |
| Msg-id | CAD21AoB3Jhi6HseKFDvcffcwvr3Vgh2ahKjXn7_628zLYby97w@mail.gmail.com Whole thread Raw |
| In response to | RE: Assertion failure in SnapBuildInitialSnapshot() ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
| List | pgsql-hackers |
On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Nov 24, 2025 at 10:48 AM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > > > On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > > > On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > > > > > I have been thinking if there a way to avoid holding > > > > > ReplicationSlotControlLock exclusively in > > > > > ReplicationSlotsComputeRequiredXmin() because that could cause lock > > contention when many slots exist and advancements occur frequently. > > > > > > > > > > Given that the bug arises from a race condition between slot > > > > > creation and concurrent slot xmin computation, I think another way > > > > > is that, we acquire the ReplicationSlotControlLock exclusively > > > > > only during slot creation to do the initial update of the slot > > > > > xmin. In ReplicationSlotsComputeRequiredXmin(), we still hold the > > > > > ReplicationSlotControlLock in shared mode until the global slot > > > > > xmin is updated in ProcArraySetReplicationSlotXmin(). This > > > > > approach prevents concurrent computations and updates of new xmin > > > > > horizons by other backends during the initial slot xmin update process, > > while it still permits concurrent calls to > > ReplicationSlotsComputeRequiredXmin(). > > > > > > > > > > > > > Yeah, this seems to work. > > > > > > +1 > > > > Given that the computation of xmin and catalog_xmin among all slots could > > be executed concurrently, could the following scenario happen where > > procArray->replication_slot_xmin and > > procArray->replication_slot_catalog_xmin are retreat to a non-invalid > > XID? > > > > 1. Suppose the initial value procArray->replication_slot_catalog_xmin is 50. > > 2. Process-A updates its owned slot's catalog_xmin to 100, and computes the > > new catalog_xmin as 100 while holding ReplicationSlotControlLock in a shared > > mode in ReplicationSlotsComputeRequiredLSN(). But it doesn't update the > > procArray's catalog_xmin value yet. > > 3. Process-B updates its owned slot's catalog_xmin to 150, and computes the > > new catalog_xmin as 150. > > 4. Process-B updates the procArray->replication_slot_catalog_xmin to 150. > > 5. Process-A updates the procArray->repilcation_slot_catalog_xmin to 100, > > which was 150. > > After further investigation, I think that steps 3 and 4 cannot occur because > Process-B must have already encountered the catalog_xmin maintained by > Process-A, either 50 or 100. Consequently, Process-B will refrain from updating > the catalog_xmin to a more recent value, such as 150. Right. But the following scenario seems to happen: 1. Both processes have a slot with effective_catalog_xmin = 100. 2. Process-A updates effective_catalog_xmin to 150, and computes the new catalog_xmin as 100 because process-B slot still has effective_catalog_xmin = 100. 3. Process-B updates effective_catalog_xmin to 150, and computes the new catalog_xmin as 150. 4. Process-B updates procArray->replication_slot_catalog_xmin to 150. 5. Process-A updates procArray->replication_slot_catalog_xmin to 100. > > > > > It might be worth adding an assertion to ProcArraySetReplicationSlotXmin(), > > checking if the new xmin and catalog_xmin values are either >= the current > > values or an InvalidTransactionId. > > I considered this scenario and identified a potential exception in the > copy_replication_slot(). This function uses a two-phase copy process, the > original restart_lsn is directly copied to the new slot during the first phase. > However, the original slot.restart_lsn might advance between phases. > Consequently, the newly created slot initially uses the outdated restart_lsn, > which could cause the procArray->replication_slot_catalog_xmin to retreat. I > think this behavior isn't harmful, as explained in the comments, because the new > restart_lsn will be updated in the created slot during the second phase. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: