Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Single transaction in the tablesync worker?
Date
Msg-id CAA4eK1JhpuwujrV6ABMmZ3jXfW37ssZnJ3fikrY7rRdvoEmu_g@mail.gmail.com
Whole thread Raw
In response to Re: Single transaction in the tablesync worker?  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: Single transaction in the tablesync worker?
Re: Single transaction in the tablesync worker?
Re: Single transaction in the tablesync worker?
Re: Single transaction in the tablesync worker?
List pgsql-hackers
On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> PSA the v18 patch for the Tablesync Solution1.
>

Few comments:
=============
1.
- *   So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
- *   CATCHUP -> SYNCDONE -> READY.
+ *   So the state progression is always: INIT -> DATASYNC ->
+ *   (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.

I don't think we need to be specific here that sync worker sets
FINISHEDCOPY state.

2.
@@ -98,11 +102,16 @@
 #include "miscadmin.h"
 #include "parser/parse_relation.h"
 #include "pgstat.h"
+#include "postmaster/interrupt.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
+#include "replication/logicalworker.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
+#include "replication/slot.h"

I don't think the above includes are required. They seem to the
remnant of the previous approach.

3.
 process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
- Assert(IsTransactionState());
+ bool sync_done = false;

  SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+ sync_done = MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
+ current_lsn >= MyLogicalRepWorker->relstate_lsn;
+ SpinLockRelease(&MyLogicalRepWorker->relmutex);

- if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
- current_lsn >= MyLogicalRepWorker->relstate_lsn)
+ if (sync_done)
  {
  TimeLineID tli;

+ /*
+ * Change state to SYNCDONE.
+ */
+ SpinLockAcquire(&MyLogicalRepWorker->relmutex);

Why do we need these changes? If you have done it for the
code-readability purpose then we can consider this as a separate patch
because I don't see why these are required w.r.t this patch.

4.
- /*
- * To build a slot name for the sync work, we are limited to NAMEDATALEN -
- * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
- * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
- * NAMEDATALEN on the remote that matters, but this scheme will also work
- * reasonably if that is different.)
- */
- StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity */
- slotname = psprintf("%.*s_%u_sync_%u",
- NAMEDATALEN - 28,
- MySubscription->slotname,
- MySubscription->oid,
- MyLogicalRepWorker->relid);
+ /* Calculate the name of the tablesync slot. */
+ slotname = ReplicationSlotNameForTablesync(
+    MySubscription->oid,
+    MyLogicalRepWorker->relid);

What is the reason for changing the slot name calculation? If there is
any particular reasons, then we can add a comment to indicate why we
can't include the subscription's slotname in this calculation.

5.
This is WAL
+ * logged for for the purpose of recovery. Locks are to prevent the
+ * replication origin from vanishing while advancing.

/for for/for

6.
+ /* Remove the tablesync's origin tracking if exists. */
+ snprintf(originname, sizeof(originname), "pg_%u_%u", subid, relid);
+ originid = replorigin_by_name(originname, true);
+ if (originid != InvalidRepOriginId)
+ {
+ elog(DEBUG1, "DropSubscription: dropping origin tracking for
\"%s\"", originname);

I don't think we need this and the DEBUG1 message in
AlterSubscription_refresh. IT is fine to print this information for
background workers like in apply-worker but not sure if need it here.
The DropSubscription drops the origin of apply worker but it doesn't
use such a DEBUG message so I guess we don't it for tablesync origins
as well.

7. Have you tested with the new patch the scenario where we crash
after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
replication using the new temporary slot? Here, we need to test the
case where during the catchup phase we have received few commits and
then the tablesync worker is crashed/errored out? Basically, check if
the replication is continued from the same point? I understand that
this can be only tested by adding some logs and we might not be able
to write a test for it.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Is Recovery actually paused?
Next
From: Mark Rofail
Date:
Subject: Re: [HACKERS] GSoC 2017: Foreign Key Arrays