RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Conflict detection for update_deleted in logical replication
Date
Msg-id TY4PR01MB1690721034ADBE0447A3FFA3C943AA@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
On Thursday, August 28, 2025 6:07 PM shveta malik <shveta.malik@gmail.com> wrote:
> On Thu, Aug 28, 2025 at 8:02 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> >
> > I noticed that Cfbot failed to compile the document due to a typo
> > after renaming the subscription option. Here are the updated V67
> > patches to fix that, only the doc in 0001 is modified.
> >
> 
> Please find a  few comments:

Thanks for the comments.

> 
> patch001 :
> 1)
> Assert is hit while testing patch001 alone:
> TRAP: failed Assert("TransactionIdIsValid(nonremovable_xid)"), File:
> "launcher.c", Line: 1394, PID: 55350.
> 
> Scenario:
> I have 2 subs, both have stopped retention. Now on one of the sub, if I do this:
> 
> --switch off rdt:
> alter subscription sub1 disable;
> alter subscription sub1 set (retain_dead_tuples=off); alter subscription sub1
> enable;
> 
> --switch back rdt, launcher asserts after this alter subscription sub1 disable;
> alter subscription sub1 set (retain_dead_tuples=on); alter subscription sub1
> enable;

I fixed it by skipping collecting the xid from workers when the overall
retention has been stopped.

When fixing the reported Assert failure, I noticed another race condition that
can hit this Assert: If the launcher has obtained
pg_subscription.retentionactive and has not yet called
compute_min_nonremovable_xid(), then if the apply worker stops retention and
sets oldest_nonremovable_xid to invalid, the launcher will hit the assertion
again when calling compute_min_nonremovable_xid(). I fixed it by adding a check
in compute function:

    /*
     * Return if the apply worker has stopped retention concurrently.
     *
     * Although this function is invoked only when retentionactive is true, the
     * apply worker might stop retention after the launcher fetches the
     * retentionactive flag.
     */
    if (!TransactionIdIsValid(nonremovable_xid))
        return;

The logic in 0002 is also adjusted to avoid assigning the slot.xmin to worker in this
case.

> 
> 2)
> +          Maximum duration for which this subscription's apply worker
> is allowed
> +          to retain the information useful for conflict detection when
> +          <literal>retain_dead_tuples</literal> is enabled for the
> associated
> +          subscriptions.
> 
> associated subscriptions --> associated subscription (since we are talking
> about 'this subscription's apply worker')

I have reworded this part based on Amit's suggestion.

> 
> 3)
> +          The information useful for conflict detection is no longer
> retained if
> +          all apply workers associated with the subscriptions, where
> +          <literal>retain_dead_tuples</literal> is enabled, confirm that the
> +          retention duration exceeded the
> 
> A trivial thing: 'retention duration has exceeded' sounds better to me.

Changed

> 
> ~~
> 
> patch002:
> (feel free to defer these comments if we are focusing on patch001 right now):
> 
> 4)
> stop_conflict_info_retention() has the correct message and detail in
> patch01 while in patch02, it is switched back to the older one (wrong one).
> Perhaps some merge mistake

Right, It a miss when rebasing the 0002. Fixed.

> 
> 5)
> resume_conflict_info_retention() still refers to the old GUC name:
> max_conflict_retention_duration.

Fixed.

> 
> 6)
> In compute_min_nonremovable_xid(), shall we have
> Assert(TransactionIdIsValid(nonremovable_xid)) before assigning it to
> worker->oldest_nonremovable_xid? Here:
> 
> + nonremovable_xid = MyReplicationSlot->data.xmin;
> +
> + SpinLockAcquire(&worker->relmutex);
> + worker->oldest_nonremovable_xid = nonremovable_xid;
> + SpinLockRelease(&worker->relmutex);

Added.

> 7)
> Now since compute_min_nonremovable_xid() is also taking care of assigning
> valid xid to the worker, shall we update that in comment as well? We can have
> one more line:
> 'Additionally if any of the apply workers has invalid xid, assign slot's xmin to it.'

Added.

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication
Next
From: Chao Li
Date:
Subject: Re: Mark ItemPointer parameters as const in tuple/table lock functions