Re: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers
| From | Pradeep Kumar |
|---|---|
| Subject | Re: Assertion failure in SnapBuildInitialSnapshot() |
| Date | |
| Msg-id | CAJ4xhPmdjViFTGTsLd2VuziSYYVUCfq9v-t-dL-WYT__RpouiQ@mail.gmail.com Whole thread Raw |
| In response to | Re: Assertion failure in SnapBuildInitialSnapshot() (Masahiko Sawada <sawada.mshk@gmail.com>) |
| Responses |
RE: Assertion failure in SnapBuildInitialSnapshot()
|
| List | pgsql-hackers |
I've been investigating the assert failure in ProcArraySetReplicationSlotXmin() and would like to share my approach and get feedback. Instead of inverting the locks and what robert shared before [1].
Instead of unconditionally updating procArray->replication_slot_xmin in ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates conditional:
1) Only update if the incoming xmin is valid
2) Only update if it's older than the currently stored xmin
3) Do the same for procArray->replication_slot_catalog_xmin
void ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin, bool already_locked)
{
Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
if (!already_locked)
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
if (TransactionIdIsValid(xmin))
{
if (!TransactionIdIsValid(procArray->replication_slot_xmin) || TransactionIdPrecedes(xmin, procArray->replication_slot_xmin))
procArray->replication_slot_xmin = xmin;
}
if (TransactionIdIsValid(catalog_xmin))
{
if (!TransactionIdIsValid(procArray->replication_slot_catalog_xmin) || TransactionIdPrecedes(catalog_xmin, procArray->replication_slot_catalog_xmin))
procArray->replication_slot_catalog_xmin = catalog_xmin;
}
if (!already_locked)
LWLockRelease(ProcArrayLock);
elog(DEBUG1, "xmin required by slots: data %u, catalog %u", xmin, catalog_xmin);
}
In above block of code ensures we always track the minimum xmin across all active replication slots without losing data. And also no need to worry about locks. And also while reproducing this issue [2] In SnapBuildInitialSnapshot() while we computing safexid by calling GetOldestSafeDecodingTransactionId(false) will enters into first case and update the oldestSafeXid = procArray->replication_slot_xmin. So it won't return nextXid. And also it solves this issue [2].
[1] https://www.postgresql.org/message-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
Instead of unconditionally updating procArray->replication_slot_xmin in ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates conditional:
1) Only update if the incoming xmin is valid
2) Only update if it's older than the currently stored xmin
3) Do the same for procArray->replication_slot_catalog_xmin
void ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin, bool already_locked)
{
Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
if (!already_locked)
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
if (TransactionIdIsValid(xmin))
{
if (!TransactionIdIsValid(procArray->replication_slot_xmin) || TransactionIdPrecedes(xmin, procArray->replication_slot_xmin))
procArray->replication_slot_xmin = xmin;
}
if (TransactionIdIsValid(catalog_xmin))
{
if (!TransactionIdIsValid(procArray->replication_slot_catalog_xmin) || TransactionIdPrecedes(catalog_xmin, procArray->replication_slot_catalog_xmin))
procArray->replication_slot_catalog_xmin = catalog_xmin;
}
if (!already_locked)
LWLockRelease(ProcArrayLock);
elog(DEBUG1, "xmin required by slots: data %u, catalog %u", xmin, catalog_xmin);
}
In above block of code ensures we always track the minimum xmin across all active replication slots without losing data. And also no need to worry about locks. And also while reproducing this issue [2] In SnapBuildInitialSnapshot() while we computing safexid by calling GetOldestSafeDecodingTransactionId(false) will enters into first case and update the oldestSafeXid = procArray->replication_slot_xmin. So it won't return nextXid. And also it solves this issue [2].
[1] https://www.postgresql.org/message-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
On Fri, Nov 7, 2025 at 11:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Nov 6, 2025 at 8:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 8:30 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, November 7, 2025 2:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > On Thu, Nov 6, 2025 at 2:36 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > >
> > > > Good point. This can happen when the last slot is invalidated or dropped.
> > >
> > > After the last slot is invalidated or dropped, both slot_xmin and
> > > slot_catalog_xmin values are set InvalidTransactionId. Then in this
> > > case, these values are ignored when computing the oldest safe decoding
> > > XID in GetOldestSafeDecodingTransactionId(), no? Or do you mean that
> > > there is a case where slot_xmin and slot_catalog_xmin retreat to a
> > > valid XID?
> >
> > I think when replication_slot_xmin is invalid,
> > GetOldestSafeDecodingTransactionId would return nextXid, which can be greater
> > than the original snap.xmin if some transaction IDs have been assigned.
> >
>
> Won't we have a problem that values of
> procArray->replication_slot_xmin and
> procArray->replication_slot_catalog_xmin won't be set to
> InvalidTransactionId after last slot removal due to a new check unless
> we do special treatment for drop/invalidation of a slot? And that
> would lead to accumulating dead rows even when not required.
I understand Hou-san's point. Agreed. procArray->replication_slot_xmin
and replication_slot_catalog_xmin should not retreat to a valid XID
but could become 0 (invalid). Let's consider the idea of inverting the
locks as Andres proposed[1].
Regards,
[1] https://www.postgresql.org/message-id/20230207194903.ws4acm7ake6ikacn%40awork3.anarazel.de
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: