Re: Slow catchup of 2PC (twophase) transactions on replica in LR - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Date
Msg-id CAHut+Ptu_w_UCGR-5DbenA+y7wRiA8QPi_ZP=CCJ3SGdTn-==g@mail.gmail.com
Whole thread Raw
In response to RE: Slow catchup of 2PC (twophase) transactions on replica in LR  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses RE: Slow catchup of 2PC (twophase) transactions on replica in LR
List pgsql-hackers
Hi, Here are some review comments for v7-0002

======
Commit message

1.
IIUC there is quite a lot of subtlety and details about why the slot
option needs to be changed only when altering "true" to "false", but
not when altering "false" to "true".

It also should explain why PreventInTransactionBlock is only needed
when altering two_phase "true" to "false".

Please include a commit message to describe all those tricky details.

======
src/backend/commands/subscriptioncmds.c

2. AlterSubscription

- PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... SET
(two_phase)");
+ if (!opts.twophase)
+ PreventInTransactionBlock(isTopLevel,
+   "ALTER SUBSCRIPTION ... SET (two_phase = off)");

IMO this needs a comment to explain why PreventInTransactionBlock is
only needed when changing the 'two_phase' option from on to off.

~~~

3. AlterSubscription

/*
* Try to acquire the connection necessary for altering slot.
*
* This has to be at the end because otherwise if there is an error while
* doing the database operations we won't be able to rollback altered
* slot.
*/
if (replaces[Anum_pg_subscription_subfailover - 1] ||
replaces[Anum_pg_subscription_subtwophasestate - 1])
{
bool must_use_password;
char    *err;
WalReceiverConn *wrconn;
bool failover_needs_to_be_updated;
bool two_phase_needs_to_be_updated;

/* Load the library providing us libpq calls. */
load_file("libpqwalreceiver", false);

/* Try to connect to the publisher. */
must_use_password = sub->passwordrequired && !sub->ownersuperuser;
wrconn = walrcv_connect(sub->conninfo, true, true, must_use_password,
sub->name, &err);
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
errmsg("could not connect to the publisher: %s", err)));

/*
* Consider which slot option must be altered.
*
* We must alter the failover option whenever subfailover is updated.
* Two_phase, however, is altered only when changing true to false.
*/
failover_needs_to_be_updated =
replaces[Anum_pg_subscription_subfailover - 1];
two_phase_needs_to_be_updated =
(replaces[Anum_pg_subscription_subtwophasestate - 1] &&
!opts.twophase);

PG_TRY();
{
if (two_phase_needs_to_be_updated || failover_needs_to_be_updated)
walrcv_alter_slot(wrconn, sub->slotname,
  failover_needs_to_be_updated ? &opts.failover : NULL,
  two_phase_needs_to_be_updated ? &opts.twophase : NULL);
}
PG_FINALLY();
{
walrcv_disconnect(wrconn);
}
PG_END_TRY();
}

3a.
The block comment "Consider which slot option must be altered..." says
WHEN those options need to be updated, but it doesn't say WHY. e.g.
why only update the 'two_phase' when it is being disabled but not when
it is being enabled? In other words, I think there needs to be more
background/reason details given in this comment.

~~~

3b.
Can't those 2 new variable assignments be done up-front and guard this
entire "if-block" instead of the current replaces[] guarding it? Then
the code is somewhat simplified.

SUGGESTION:
/*
 * <improved comment here to explain these variables>
 */
update_failover = replaces[Anum_pg_subscription_subfailover - 1];
update_two_phase = (replaces[Anum_pg_subscription_subtwophasestate -
1] && !opts.twophase);

/*
 * Try to acquire the connection necessary for altering slot.
 *
 * This has to be at the end because otherwise if there is an error while
 * doing the database operations we won't be able to rollback altered
 * slot.
 */
if (update_failover || update_two_phase)
{
  ...

  /* Load the library providing us libpq calls. */
  load_file("libpqwalreceiver", false);

  /* Try to connect to the publisher. */
  must_use_password = sub->passwordrequired && !sub->ownersuperuser;
  wrconn = walrcv_connect(sub->conninfo, true, true,
must_use_password, sub->name, &err);
  if (!wrconn)
    ereport(ERROR, ...);

  PG_TRY();
  {
    walrcv_alter_slot(wrconn, sub->slotname,
      update_failover ? &opts.failover : NULL,
      update_two_phase ? &opts.twophase : NULL);
  }
  PG_FINALLY();
  {
    walrcv_disconnect(wrconn);
  }
  PG_END_TRY();
}

======
.../libpqwalreceiver/libpqwalreceiver.c

4.
+ appendStringInfo(&cmd, "ALTER_REPLICATION_SLOT %s ( ",
+ quote_identifier(slotname));
+
+ if (failover)
+ appendStringInfo(&cmd, "FAILOVER %s ",
+ (*failover) ? "true" : "false");
+
+ if (two_phase)
+ appendStringInfo(&cmd, "TWO_PHASE %s%s ",
+ (*two_phase) ? "true" : "false",
+ failover ? ", " : "");
+
+ appendStringInfoString(&cmd, ");");

4a.
IIUC the comma logic here was broken in v7 when you swapped the order.
Anyway, IMO it will be better NOT to try combining that comma logic
with the existing appendStringInfo. Doing it separately is both easier
and less error-prone.

Furthermore, the parentheses like "(*two_phase)" instead of just
"*two_phase" seemed a bit overkill.

SUGGESTION:
+ if (failover)
+ appendStringInfo(&cmd, "FAILOVER %s",
+ *failover ? "true" : "false");
+
+   if (failover && two_phase)
+       appendStringInfo(&cmd, ", ");
+
+ if (two_phase)
+ appendStringInfo(&cmd, "TWO_PHASE %s",
+ *two_phase ? "true" : "false");
+
+ appendStringInfoString(&cmd, " );");

~~

4b.
Like I said above, IMO the current separator logic in v7 is broken. So
it is a bit concerning the tests all passed anyway. How did that
happen? I think this indicates that there needs to be an additional
test scenario where both 'failover' and 'two_phase' get altered at the
same time so this code gets exercised properly.

======
src/test/subscription/t/099_twophase_added.pl

5.
+# Define pre-existing tables on both nodes

Why say they are "pre-existing"? They are not pre-existing because you
are creating them right here!

~~~

6.
+######
+# Check the case that prepared transactions exist on publisher node
+######

I think this needs a slightly more detailed comment.

SUGGESTION (this is just an example, but you can surely improve it)

# Check the case that prepared transactions exist on the publisher node.
#
# Since two_phase is "off", then normally this PREPARE will do nothing until
# the COMMIT PREPARED, but in this test, we toggle the two_phase to "on" again
# before the COMMIT PREPARED happens.

~~~

7.
Maybe this test case needs a few more one-line comments for each of
the sub-steps. e.g.:

# prepare a transaction to insert some rows to the table

# verify the prepared tx is not yet replicated to the subscriber
(because 'two_phase = off')

# toggle the two_phase to 'on' *before* the COMMIT PREPARED

# verify the inserted rows got replicated ok

~~~

8.
IIUC this test will behave the same even if you DON'T do the toggle
'two_phase = on'. So I wonder is there something more you can do to
test this scenario more convincingly?

======
Kind Regards,
Peter Smith
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Next
From: Richard Guo
Date:
Subject: Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.