RE: Slow catchup of 2PC (twophase) transactions on replica in LR - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Date
Msg-id OSBPR01MB2552D3618A0BAB99ECDDD589F5E62@OSBPR01MB2552.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Slow catchup of 2PC (twophase) transactions on replica in LR  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
Dear Peter,

> ======
> Commit message
> 
> 1.
> IIUC there is quite a lot of subtlety and details about why the slot
> option needs to be changed only when altering "true" to "false", but
> not when altering "false" to "true".
> 
> It also should explain why PreventInTransactionBlock is only needed
> when altering two_phase "true" to "false".
> 
> Please include a commit message to describe all those tricky details.

Added.

> ======
> src/backend/commands/subscriptioncmds.c
> 
> 2. AlterSubscription
> 
> - PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... SET
> (two_phase)");
> + if (!opts.twophase)
> + PreventInTransactionBlock(isTopLevel,
> +   "ALTER SUBSCRIPTION ... SET (two_phase = off)");
> 
> IMO this needs a comment to explain why PreventInTransactionBlock is
> only needed when changing the 'two_phase' option from on to off.

Added. Thoutht?

> 3. AlterSubscription
> 
> /*
> * Try to acquire the connection necessary for altering slot.
> *
> * This has to be at the end because otherwise if there is an error while
> * doing the database operations we won't be able to rollback altered
> * slot.
> */
> if (replaces[Anum_pg_subscription_subfailover - 1] ||
> replaces[Anum_pg_subscription_subtwophasestate - 1])
> {
> bool must_use_password;
> char    *err;
> WalReceiverConn *wrconn;
> bool failover_needs_to_be_updated;
> bool two_phase_needs_to_be_updated;
> 
> /* Load the library providing us libpq calls. */
> load_file("libpqwalreceiver", false);
> 
> /* Try to connect to the publisher. */
> must_use_password = sub->passwordrequired && !sub->ownersuperuser;
> wrconn = walrcv_connect(sub->conninfo, true, true, must_use_password,
> sub->name, &err);
> if (!wrconn)
> ereport(ERROR,
> (errcode(ERRCODE_CONNECTION_FAILURE),
> errmsg("could not connect to the publisher: %s", err)));
> 
> /*
> * Consider which slot option must be altered.
> *
> * We must alter the failover option whenever subfailover is updated.
> * Two_phase, however, is altered only when changing true to false.
> */
> failover_needs_to_be_updated =
> replaces[Anum_pg_subscription_subfailover - 1];
> two_phase_needs_to_be_updated =
> (replaces[Anum_pg_subscription_subtwophasestate - 1] &&
> !opts.twophase);
> 
> PG_TRY();
> {
> if (two_phase_needs_to_be_updated || failover_needs_to_be_updated)
> walrcv_alter_slot(wrconn, sub->slotname,
>   failover_needs_to_be_updated ? &opts.failover : NULL,
>   two_phase_needs_to_be_updated ? &opts.twophase : NULL);
> }
> PG_FINALLY();
> {
> walrcv_disconnect(wrconn);
> }
> PG_END_TRY();
> }
> 
> 3a.
> The block comment "Consider which slot option must be altered..." says
> WHEN those options need to be updated, but it doesn't say WHY. e.g.
> why only update the 'two_phase' when it is being disabled but not when
> it is being enabled? In other words, I think there needs to be more
> background/reason details given in this comment.
> 
> ~~~
> 
> 3b.
> Can't those 2 new variable assignments be done up-front and guard this
> entire "if-block" instead of the current replaces[] guarding it? Then
> the code is somewhat simplified.
> 
> SUGGESTION:
> /*
>  * <improved comment here to explain these variables>
>  */
> update_failover = replaces[Anum_pg_subscription_subfailover - 1];
> update_two_phase = (replaces[Anum_pg_subscription_subtwophasestate -
> 1] && !opts.twophase);
> 
> /*
>  * Try to acquire the connection necessary for altering slot.
>  *
>  * This has to be at the end because otherwise if there is an error while
>  * doing the database operations we won't be able to rollback altered
>  * slot.
>  */
> if (update_failover || update_two_phase)
> {
>   ...
> 
>   /* Load the library providing us libpq calls. */
>   load_file("libpqwalreceiver", false);
> 
>   /* Try to connect to the publisher. */
>   must_use_password = sub->passwordrequired && !sub->ownersuperuser;
>   wrconn = walrcv_connect(sub->conninfo, true, true,
> must_use_password, sub->name, &err);
>   if (!wrconn)
>     ereport(ERROR, ...);
> 
>   PG_TRY();
>   {
>     walrcv_alter_slot(wrconn, sub->slotname,
>       update_failover ? &opts.failover : NULL,
>       update_two_phase ? &opts.twophase : NULL);
>   }
>   PG_FINALLY();
>   {
>     walrcv_disconnect(wrconn);
>   }
>   PG_END_TRY();
> }

Two variables were added and comments were updated.

> .../libpqwalreceiver/libpqwalreceiver.c
> 
> 4.
> + appendStringInfo(&cmd, "ALTER_REPLICATION_SLOT %s ( ",
> + quote_identifier(slotname));
> +
> + if (failover)
> + appendStringInfo(&cmd, "FAILOVER %s ",
> + (*failover) ? "true" : "false");
> +
> + if (two_phase)
> + appendStringInfo(&cmd, "TWO_PHASE %s%s ",
> + (*two_phase) ? "true" : "false",
> + failover ? ", " : "");
> +
> + appendStringInfoString(&cmd, ");");
> 
> 4a.
> IIUC the comma logic here was broken in v7 when you swapped the order.
> Anyway, IMO it will be better NOT to try combining that comma logic
> with the existing appendStringInfo. Doing it separately is both easier
> and less error-prone.
> 
> Furthermore, the parentheses like "(*two_phase)" instead of just
> "*two_phase" seemed a bit overkill.
> 
> SUGGESTION:
> + if (failover)
> + appendStringInfo(&cmd, "FAILOVER %s",
> + *failover ? "true" : "false");
> +
> +   if (failover && two_phase)
> +       appendStringInfo(&cmd, ", ");
> +
> + if (two_phase)
> + appendStringInfo(&cmd, "TWO_PHASE %s",
> + *two_phase ? "true" : "false");
> +
> + appendStringInfoString(&cmd, " );");

Fixed.

> 4b.
> Like I said above, IMO the current separator logic in v7 is broken. So
> it is a bit concerning the tests all passed anyway. How did that
> happen? I think this indicates that there needs to be an additional
> test scenario where both 'failover' and 'two_phase' get altered at the
> same time so this code gets exercised properly.

Right, it was added.

> ======
> src/test/subscription/t/099_twophase_added.pl
> 
> 5.
> +# Define pre-existing tables on both nodes
> 
> Why say they are "pre-existing"? They are not pre-existing because you
> are creating them right here!

Removed the word.

> 6.
> +######
> +# Check the case that prepared transactions exist on publisher node
> +######
> 
> I think this needs a slightly more detailed comment.
> 
> SUGGESTION (this is just an example, but you can surely improve it)
> 
> # Check the case that prepared transactions exist on the publisher node.
> #
> # Since two_phase is "off", then normally this PREPARE will do nothing until
> # the COMMIT PREPARED, but in this test, we toggle the two_phase to "on" again
> # before the COMMIT PREPARED happens.

Changed with adjustments.

> 7.
> Maybe this test case needs a few more one-line comments for each of
> the sub-steps. e.g.:
> 
> # prepare a transaction to insert some rows to the table
> 
> # verify the prepared tx is not yet replicated to the subscriber
> (because 'two_phase = off')
> 
> # toggle the two_phase to 'on' *before* the COMMIT PREPARED
> 
> # verify the inserted rows got replicated ok

Modified like yours, but changed based on the suggestion by Grammarly.

> 8.
> IIUC this test will behave the same even if you DON'T do the toggle
> 'two_phase = on'. So I wonder is there something more you can do to
> test this scenario more convincingly?

I found an indicator. When the apply starts, it outputs the current status of
two_phase option. I added wait_for_log() to ensure below appeared. Thought?

```
    ereport(DEBUG1,
            (errmsg_internal("logical replication apply worker for subscription \"%s\" two_phase is %s",
                             MySubscription->name,
                             MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" :
                             MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
                             MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
                             "?")));
```
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Slow catchup of 2PC (twophase) transactions on replica in LR