Re: Slow catchup of 2PC (twophase) transactions on replica in LR - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Date
Msg-id CAHut+Pt2v3dCvWCE-ng4nxS5qU-Go4T=Q2BBnv73SAAczgrqvg@mail.gmail.com
Whole thread Raw
In response to RE: Slow catchup of 2PC (twophase) transactions on replica in LR  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses RE: Slow catchup of 2PC (twophase) transactions on replica in LR
List pgsql-hackers
Hi, Here are some review comments for patch v6-0001

======
Commit message

1.
This patch allows user to alter two_phase option

/allows user/allows the user/

/to alter two_phase option/to alter the 'two_phase' option/

======
doc/src/sgml/ref/alter_subscription.sgml

2.
<literal>two_phase</literal> can be altered only for disabled subscription.

SUGGEST
The <literal>two_phase</literal> parameter can only be altered when
the subscription is disabled.

======
src/backend/access/transam/twophase.c

3. checkGid
+
+/*
+ * checkGid
+ */
+static bool
+checkGid(char *gid, Oid subid)
+{
+ int ret;
+ Oid subid_written,
+ xid;
+
+ ret = sscanf(gid, "pg_gid_%u_%u", &subid_written, &xid);
+
+ if (ret != 2 || subid != subid_written)
+ return false;
+
+ return true;
+}

3a.
The function comment should give more explanation of what it does. I
think this function is the counterpart of the TwoPhaseTransactionGid()
function of worker.c so the comment can say that too.

~

3b.
Indeed, perhaps the function name should be similar to
TwoPhaseTransactionGid. e.g. call it like
IsTwoPhaseTransactionGidForSubid?

~

3c.
Probably 'xid' should be TransactionId instead of Oid.

~

3d.
Why not have a single return?

SUGGESTION
return (ret == 2 && subid = subid_written);

~

3e.
I am wondering if the existing TwoPhaseTransactionGid function
currently in worker.c should be moved here because IMO these 2
functions belong together and twophase.c seems the right place to put
them.

~~~

4.
+LookupGXactBySubid(Oid subid)
+{
+ bool found = false;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+ for (int i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+ /* Ignore not-yet-valid GIDs. */
+ if (gxact->valid && checkGid(gxact->gid, subid))
+ {
+ found = true;
+ break;
+ }
+ }
+ LWLockRelease(TwoPhaseStateLock);
+ return found;
+}

AFAIK the gxact also has the 'xid' available, so won't it be better to
pass BOTH the 'xid' and 'subid' to the checkGid so you can do a full
comparison instead of comparing only the subid part of the gid?

======
src/backend/commands/subscriptioncmds.c

5. AlterSubscription

+ /* XXX */
+ if (IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
+ {

The "XXX" comment looks like it is meant to say something more...

~~~

6.
+ /*
+ * two_phase can be only changed for disabled
+ * subscriptions
+ */
+ if (form->subenabled)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot set %s for enabled subscription",
+ "two_phase")));

6a.
Should this have a more comprehensive comment giving the reason like
the 'failover' option has?

~~~

6b.
Maybe this should include a "translator" comment to say don't
translate the option name.

~~~

7.
+ /* Check whether the number of prepared transactions */
+ if (!opts.twophase &&
+ form->subtwophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED &&
+ LookupGXactBySubid(subid))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot disable two_phase when uncommitted prepared
transactions present")));
+

7a.
The first comment seems to be an incomplete sentence. I think it
should say something a bit like:
two_phase cannot be disabled if there are any uncommitted prepared
transactions present.

~

7b.
Also, if ereport occurs what is the user supposed to do about it?
Shouldn't the ereport include some errhint with appropriate advice?

~~~

8.
+ /*
+ * The changed failover option of the slot can't be rolled
+ * back.
+ */
+ PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... SET
(two_phase)");
+
+ /* Change system catalog acoordingly */
+ values[Anum_pg_subscription_subtwophasestate - 1] =
+ CharGetDatum(opts.twophase ?
+ LOGICALREP_TWOPHASE_STATE_PENDING :
+ LOGICALREP_TWOPHASE_STATE_DISABLED);
+ replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+ }

Typo I think: /failover option/two_phase option/

======
.../libpqwalreceiver/libpqwalreceiver.c

9.
 static void
 libpqrcv_alter_slot(WalReceiverConn *conn, const char *slotname,
- bool failover)
+ bool two_phase, bool failover)

Same comment as mentioned elsewhere (#15), IMO the new 'two_phase'
parameter should be last.

======
src/backend/replication/logical/launcher.c

10.
+/*
+ * Stop all the subscription workers.
+ */
+void
+logicalrep_workers_stop(Oid subid)
+{
+ List    *subworkers;
+ ListCell   *lc;
+
+ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+ subworkers = logicalrep_workers_find(subid, false);
+ LWLockRelease(LogicalRepWorkerLock);
+ foreach(lc, subworkers)
+ {
+ LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
+
+ logicalrep_worker_stop(w->subid, w->relid);
+ }
+ list_free(subworkers);
+}

I was confused by the logicalrep_workers_find(subid, false). IIUC the
'false' means everything (instead of 'only_running') but then I don't
know why we want to "stop" anything that is NOT running. OTOH I see
that this code was extracted from where it was previously inlined in
subscriptioncmds.c, so maybe the 'false' is necessary for another
reason? At least maybe some explanatory comment is needed for why you
are passing this flag as false?

======
src/backend/replication/logical/worker.c

11.
- /* two-phase should not be altered */
+ /* two-phase should not be altered while the worker exists */
  Assert(newsub->twophasestate == MySubscription->twophasestate);
/should not/cannot/

======
src/backend/replication/slot.c

12.
 void
-ReplicationSlotAlter(const char *name, bool failover)
+ReplicationSlotAlter(const char *name, bool two_phase, bool failover)

Same comment as mentioned elsewhere (#15), IMO the new 'two_phase'
parameter should be last.

~~~

13.
+ if (MyReplicationSlot->data.two_phase != two_phase)
+ {
+ SpinLockAcquire(&MyReplicationSlot->mutex);
+ MyReplicationSlot->data.two_phase = two_phase;
+ SpinLockRelease(&MyReplicationSlot->mutex);
+
+ update_slot = true;
+ }
+
+
  if (MyReplicationSlot->data.failover != failover)
  {
  SpinLockAcquire(&MyReplicationSlot->mutex);
  MyReplicationSlot->data.failover = failover;
  SpinLockRelease(&MyReplicationSlot->mutex);

+ update_slot = true;
+ }

13a.
Doesn't it make more sense for the whole check/set to be "atomic",
i.e. put the mutex also around the check?

SUGGEST
SpinLockAcquire(&MyReplicationSlot->mutex);
if (MyReplicationSlot->data.two_phase != two_phase)
{
  MyReplicationSlot->data.two_phase = two_phase;
  update_slot = true;
}
SpinLockRelease(&MyReplicationSlot->mutex);

~

Also, (if you agree with the above) why not include both checks
(two_phase and failover) within the same mutex instead of
acquiring/releasing it twice:

SUGGEST
SpinLockAcquire(&MyReplicationSlot->mutex);
if (MyReplicationSlot->data.two_phase != two_phase)
{
  MyReplicationSlot->data.two_phase = two_phase;
  update_slot = true;
}
if (MyReplicationSlot->data.failover != failover)
{
  MyReplicationSlot->data.failover = failover;
  update_slot = true;
}
SpinLockAcquire(&MyReplicationSlot->mutex);

~~~

13b.
There are double blank lines after the first if-block.

======
src/backend/replication/walsender.c

14.
 static void
-ParseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover)
+ParseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd,
+   bool *two_phase, bool *failover)

Same comment as mentioned elsewhere (#15), IMO the new 'two_phase'
parameter should be last.

======
src/include/replication/walreceiver.h

15.
 typedef void (*walrcv_alter_slot_fn) (WalReceiverConn *conn,
    const char *slotname,
+   bool two_phase,
    bool failover);

Somehow, I feel it is more normal to add the new code (the 'two_phase'
parameter) at the END, instead of into the middle of the existing
parameters. It also keeps it alphabetical which makes it consistent
with other places like the tab-completion code.

This comment about swapping the order (putting new stuff last) will
propagate changes to lots of other related places. I refer to this
comment in a few other places in this post but there are probably more
the same that I missed.

======
src/test/regress/sql/subscription.sql

16.
I know you do this already in the TAP test, but doesn't the test case
to demonstrate that 'two-phase' option can be altered when the
subscription is disabled actually belong here in the regression
instead?

======
src/test/subscription/t/021_twophase.pl

17.
+# Disable the subscription and alter it to two_phase = false,
+# verify that the altered subscription reflects the two_phase option.

/verify/then verify/

~~~

18.
+# Now do a prepare on publisher and make sure that it is not replicated.
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_publisher->safe_psql(
+       'postgres', qq{
+    BEGIN;
+    INSERT INTO tab_copy VALUES (100);
+    PREPARE TRANSACTION 'newgid';
+ });
+

18a.
/on publisher/on the publisher/

18b.
What is that "DROP SUBSCRIPTION tap_sub" doing here? It seems
misplaced under this comment.

~~~

19.
+# Make sure that there is 0 prepared transaction on the subscriber
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is prepared on subscriber');

19a.
SUGGESTION
Make sure there are no prepared transactions on the subscriber

~~~

19b.
/'transaction is prepared on subscriber'/'should be no prepared
transactions on subscriber'/

~~~

20.
+# Made sure that the commited transaction is replicated.

/Made sure/Make sure/

/commited/committed/

~~~

21.
+# Make sure that the two-phase is enabled on the subscriber
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT subtwophasestate FROM pg_subscription WHERE subname = 'tap_sub_copy';"
+);
+is($result, qq(e), 'two-phase is disabled');

The 'two-phase is disabled' is the identical message used in the
opposite case earlier, so something is amiss. Maybe this one should
say 'two-phase should be enabled' and the earlier counterpart should
say 'two-phase should be disabled'.

======
Kind Regards,
Peter Smith
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.
Next
From: Michael Paquier
Date:
Subject: Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?