Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAD21AoBA16LD01H46BFSdtJLLXPXR9XEXM8EvaOFvQ2Og0tkOw@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
On Tue, Nov 12, 2024 at 2:19 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, October 18, 2024 5:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > We thought of few options to fix this:
> > > >
> > > > 1) Add a Time-Based Subscription Option:
> > > >
> > > > We could add a new time-based option for subscriptions, such as
> > > > retain_dead_tuples = '5s'. In the logical launcher, after obtaining
> > > > the candidate XID, the launcher will wait for the specified time
> > > > before advancing the slot.xmin. This ensures that deleted tuples are
> > > > retained for at least the duration defined by this new option.
> > > >
> > > > This approach is designed to simulate the functionality of the GUC
> > > > (vacuum_committs_age), but with a simpler implementation that does
> > > > not impact vacuum performance. We can maintain both this time-based
> > > > method and the current automatic method. If a user does not specify
> > > > the time-based option, we will continue using the existing approach
> > > > to retain dead tuples until all concurrent transactions from the remote node
> > have been applied.
> > > >
> > > > 2) Modification to the Logical Walsender
> > > >
> > > > On the logical walsender, which is as a physical standby, we can
> > > > build an additional connection to the physical primary to obtain the
> > > > latest WAL position. This position will then be sent as feedback to
> > > > the logical subscriber.
> > > >
> > > > A potential concern is that this requires the walsender to use the
> > > > walreceiver API, which may seem a bit unnatural. And, it starts an
> > > > additional walsender process on the primary, as the logical
> > > > walsender on the physical standby will need to communicate with this
> > walsender to fetch the WAL position.
> > > >
> > >
> > > This idea is worth considering, but I think it may not be a good
> > > approach if the physical standby is cascading. We need to restrict the
> > > update_delete conflict detection, if the standby is cascading, right?
> > >
> > > The other approach is that we send current_timestamp from the
> > > subscriber and somehow check if the physical standby has applied
> > > commit_lsn up to that commit_ts, if so, it can send that WAL position
> > > to the subscriber, otherwise, wait for it to be applied. If we do this
> > > then we don't need to add a restriction for cascaded physical standby.
> > > I think the subscriber anyway needs to wait for such an LSN to be
> > > applied on standby before advancing the xmin even if we get it from
> > > the primary. This is because the subscriber can only be able to apply
> > > and flush the WAL once it is applied on the standby. Am, I missing
> > > something?
> > >
> > > This approach has a disadvantage that we are relying on clocks to be
> > > synced on both nodes which we anyway need for conflict resolution as
> > > discussed in the thread [1]. We also need to consider the Commit
> > > Timestamp and LSN inversion issue as discussed in another thread [2]
> > > if we want to pursue this approach because we may miss an LSN that has
> > > a prior timestamp.
> > >
>
> For the "publisher is also a standby" issue, I have modified the V8 patch to
> report a warning in this case. As I personally feel this is not the main use case
> for conflict detection, we can revisit it later after pushing the main patches
> receiving some user feedback.
>
> >
> > The problem due to Commit Timestamp and LSN inversion is that the standby
> > may not consider the WAL LSN from an earlier timestamp, which could lead to
> > the removal of required dead rows on the subscriber.
> >
> > The other problem pointed out by Hou-San offlist due to Commit Timestamp
> > and LSN inversion is that we could miss sending the WAL LSN that the
> > subscriber requires to retain dead rows for update_delete conflict. For example,
> > consider the following case 2 node, bidirectional setup:
> >
> > Node A:
> >   T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1; ts=10.02 AM
> >
> > Node B:
> >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
> >
> > Say subscription is created with retain_dead_tuples = true/false
> >
> > After executing T2, the apply worker on Node A will check the latest wal flush
> > location on Node B. Till that time, the T3 should have finished, so the xmin will
> > be advanced only after applying the WALs that is later than T3. So, the dead
> > tuple will not be removed before applying the T3, which means the
> > update_delete can be detected.
> >
> > As there is a gap between when we acquire the commit_timestamp and the
> > commit LSN, it is possible that T3 would have not yet flushed it's WAL even
> > though it is committed earlier than T2. If this happens then we won't be able to
> > detect update_deleted conflict reliably.
> >
> > Now, the one simpler idea is to acquire the commit timestamp and reserve WAL
> > (LSN) under the same spinlock in
> > ReserveXLogInsertLocation() but that could be costly as discussed in the
> > thread [1]. The other more localized solution is to acquire a timestamp after
> > reserving the commit WAL LSN outside the lock which will solve this particular
> > problem.
>
> Since the discussion of the WAL/LSN inversion issue is ongoing, I also thought
> about another approach that can fix the issue independently. This idea is to
> delay the non-removable xid advancement until all the remote concurrent
> transactions that may have been assigned earlier timestamp have been committed.
>
> The implementation is:
>
> On the walsender, after receiving a request, it can send the oldest xid and
> next xid along with the
>
> In response, the apply worker can safely advance the non-removable XID if
> oldest_committing_xid == nextXid, indicating that there is no race conditions.
>
> Alternatively, if oldest_committing_xid != nextXid, the apply worker might send
> a second request after some interval. If the subsequently obtained
> oldest_committing_xid surpasses or equal to the initial nextXid, it indicates
> that all previously risky transactions have committed, therefore the the
> non-removable transaction ID can be advanced.
>
>
> Attach the V8 patch set. Note that I put the new approach for above race
> condition in a temp patch " v8-0001_2-Maintain-xxx.patch.txt", because the
> approach may or may not be accepted based on the discussion in WAL/LSN
> inversion thread.

I've started to review these patch series. I've reviewed only 0001
patch for now but let me share some comments:

---
+        if (*phase == DTR_WAIT_FOR_WALSENDER_WAL_POS)
+        {
+                Assert(xid_advance_attempt_time);

What is this assertion for? If we want to check here that we have sent
a request message for the publisher, I think it's clearer if we have
"Assert(xid_advance_attempt_time > 0)". I'm not sure we really need
this assertion though since it's never false once we set
xid_advance_attempt_time.

---
+                /*
+                 * Do not return here because the apply worker might
have already
+                 * applied all changes up to remote_lsn. Instead,
proceed to the
+                 * next phase to check if we can immediately advance
the transaction
+                 * ID.
+                 */
+                *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+        }

If we always proceed to the next phase, is this phase really
necessary? IIUC even if we jump to DTR_WAIT_FOR_LOCAL_FLUSH phase
after DTR_REQUEST_WALSENDER_WAL_POS and have a check if we received
the remote WAL position in DTR_WAIT_FOR_LOCAL_FLUSH phase, it would
work fine.

---
+                /*
+                 * Reaching here means the remote WAL position has
been received, and
+                 * all transactions up to that position on the
publisher have been
+                 * applied and flushed locally. So, now we can advance the
+                 * non-removable transaction ID.
+                 */
+                SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+                MyLogicalRepWorker->oldest_nonremovable_xid = candidate_xid;
+                SpinLockRelease(&MyLogicalRepWorker->relmutex);

How about adding a debug log message showing new
oldest_nonremovable_xid and related LSN for making the
debug/investigation easier? For example,

elog(LOG, "confirmed remote flush up to %X/%X: new oldest_nonremovable_xid %u",
     LSN_FORMAT_ARGS(*remote_lsn),
     XidFromFullTransactionId(candidate_xid));

---
+                /*
+                 * Exit early if the user has disabled sending messages to the
+                 * publisher.
+                 */
+                if (wal_receiver_status_interval <= 0)
+                        return;

In send_feedback(), we send a feedback message if the publisher
requests, even if wal_receiver_status_interval is 0. On the other
hand, the above codes mean that we don't send a WAL position request
and therefore never update oldest_nonremovable_xid if
wal_receiver_status_interval is 0. I'm concerned it could be a pitfall
for users.

---
% git show | grep update_delete
    This set of patches aims to support the detection of
update_deleted conflicts,
    transactions with earlier timestamps than the DELETE, detecting
update_delete
    We assume that the appropriate resolution for update_deleted conflicts, to
    that when detecting the update_deleted conflict, and the remote update has a
+ * to detect update_deleted conflicts.
+ * update_deleted is necessary, as the UPDATEs in remote transactions should be
+        * to allow for the detection of update_delete conflicts when applying

There are mixed 'update_delete' and 'update_deleted' in the commit
message and the codes. I think it's better to use 'update_deleted'.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Jim Jones
Date:
Subject: Re: [PoC] XMLCast (SQL/XML X025)
Next
From: Masahiro Ikeda
Date:
Subject: Re: Fix to increment the index scan counter for the bloom filter index