Re: Found issues related with logical replication and 2PC - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Found issues related with logical replication and 2PC
Date
Msg-id CAA4eK1+SRn0zOnJqBixJZSNF4V0cp2xw47LSY8yRGcx2QqLJkQ@mail.gmail.com
Whole thread Raw
In response to Found issues related with logical replication and 2PC  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses RE: Found issues related with logical replication and 2PC
List pgsql-hackers
On Wed, Jul 24, 2024 at 12:25 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> While creating a patch which allows ALTER SUBSCRIPTION SET (two_phase) [1],
> we found some issues related with logical replication and two_phase. I think this
> can happen not only HEAD but PG14+, but for now I shared patches for HEAD.
>
> Issue #1
>
> When handling a PREPARE message, the subscriber mistook the wrong lsn position
> (the end position of the last commit) as the end position of the current prepare.
> This can be fixed by adding a new global variable to record the end position of
> the last prepare. 0001 patch fixes the issue.
>
> Issue #2
>
> When the subscriber enables two-phase commit but doesn't set max_prepared_transaction >0
> and a transaction is prepared on the publisher, the apply worker reports an ERROR
> on the subscriber. After that, the prepared transaction is not replayed, which
> means it's lost forever. Attached script can emulate the situation.
>
> --
> ERROR:  prepared transactions are disabled
> HINT:  Set "max_prepared_transactions" to a nonzero value.
> --
>
> The reason is that we advanced the origin progress when aborting the
> transaction as well (RecordTransactionAbort->replorigin_session_advance). So,
> after setting replorigin_session_origin_lsn, if any ERROR happens when preparing
> the transaction, the transaction aborts which incorrectly advances the origin lsn.
>
> An easiest fix is to reset session replication origin before calling the
> RecordTransactionAbort(). I think this can happen when 1) LogicalRepApplyLoop()
> raises an ERROR or 2) apply worker exits. 0002 patch fixes the issue.
>

Can we start a separate thread to issue 2? I understand that this one
is also related to two_phase but since both are different issues it is
better to discuss in separate threads. This will also help us refer to
the discussion in future if required.

BTW, why did the 0002 patch change the below code:
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -164,7 +164,8 @@ typedef struct ParallelApplyWorkerShared

  /*
  * XactLastCommitEnd or XactLastPrepareEnd from the parallel apply worker.
- * This is required by the leader worker so it can update the lsn_mappings.
+ * This is required by the leader worker so it can update the
+ * lsn_mappings.
  */
  XLogRecPtr last_commit_end;

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Found issues related with logical replication and 2PC
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber