Re: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Optionally automatically disable logical replication subscriptions on error
Date
Msg-id CAHut+PtrbRA_u8Z13mmA2b+PnUE0hUVkyx+q0dfc8CpBBzzKbw@mail.gmail.com
Whole thread Raw
In response to RE: Optionally automatically disable logical replication subscriptions on error  ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
Responses Re: Optionally automatically disable logical replication subscriptions on error  (Masahiko Sawada <sawada.mshk@gmail.com>)
RE: Optionally automatically disable logical replication subscriptions on error  ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
List pgsql-hackers
On Tue, Feb 22, 2022 at 3:11 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Tuesday, February 22, 2022 7:53 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Monday, February 21, 2022 2:56 PM Peter Smith
> > <smithpb2250@gmail.com> wrote:
> > > > Thanks for addressing my previous comments. Now I have looked at v19.
> > > >
> > > > On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
> > > > <osumi.takamichi@fujitsu.com> wrote:
> > > > >
> > > > > On Friday, February 18, 2022 3:27 PM Peter Smith
> > > > <smithpb2250@gmail.com> wrote:
> > > > > > Hi. Below are my code review comments for v18.
> > > > > Thank you for your review !
> > > > ...
> > > > > > 5. src/backend/replication/logical/worker.c -
> > > > > > DisableSubscriptionOnError
> > > > > >
> > > > > > + /*
> > > > > > + * We would not be here unless this subscription's disableonerr
> > > > > > + field was
> > > > > > + * true when our worker began applying changes, but check
> > > > > > + whether that
> > > > > > + * field has changed in the interim.
> > > > > > + */
> > > > > >
> > > > > > Apparently, this function might just do nothing if it detects
> > > > > > some situation where the flag was changed somehow, but I'm not
> > > > > > 100% sure that the callers are properly catering for when nothing
> > happens.
> > > > > >
> > > > > > IMO it would be better if this function would return true/false
> > > > > > to mean "did disable subscription happen or not?" because that
> > > > > > will give the calling code the chance to check the function
> > > > > > return and do the right thing - e.g. if the caller first thought
> > > > > > it should be disabled but then
> > > > it turned out it did NOT disable...
> > > > > I don't think we need to do something more.
> > > > > After this function, table sync worker and the apply worker just exit.
> > > > > IMO, we don't need to do additional work for already-disabled
> > > > > subscription on the caller sides.
> > > > > It should be sufficient to fulfill the purpose of
> > > > > DisableSubscriptionOnError or confirm it has been fulfilled.
> > > >
> > > > Hmmm -  Yeah, it may be the workers might just exit soon after
> > > > anyhow as you say so everything comes out in the wash, but still, I
> > > > felt for this case when DisableSubscriptionOnError turned out to do
> > > > nothing it would be better to exit via the existing logic. And that is easy to do
> > if the function returns true/false.
> > > >
> > > > For example, changes like below seemed neater code to me. YMMV.
> > > >
> > > > BEFORE (SyncTableStartWrapper):
> > > > if (MySubscription->disableonerr)
> > > > {
> > > > DisableSubscriptionOnError();
> > > > proc_exit(0);
> > > > }
> > > > AFTER
> > > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > > proc_exit(0);
> > > >
> > > > BEFORE (ApplyLoopWrapper)
> > > > if (MySubscription->disableonerr)
> > > > {
> > > > /* Disable the subscription */
> > > > DisableSubscriptionOnError();
> > > > return;
> > > > }
> > > > AFTER
> > > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > > return;
> > > Okay, so this return value works for better readability.
> > > Fixed.
> > >
> > >
> > > > ~~~
> > > >
> > > > Here are a couple more comments:
> > > >
> > > > 1. src/backend/replication/logical/worker.c -
> > > > DisableSubscriptionOnError, Refactor error handling
> > > >
> > > > (this comment assumes the above gets changed too)
> > > I think those are independent.
> >
> > OK. I was only curious if the change #5 above might cause the error to be logged
> > 2x, if the DisableSubscriptionOnError returns false.
> > - firstly, when it logs errors within the function
> > - secondly, by normal error mechanism when the caller re-throws it.
> >
> > But, if you are sure that won't happen then it is good news.
> I didn't feel this would become a substantial issue.
>
> When we alter subscription with disable_on_error = false
> after we go into the DisableSubscriptionOnError,
> we don't disable the subscription in the same function.
> That means we launch new apply workers repeatedly after that
> until we solve the error cause or we set the disable_on_error = true again.
>
> So, if we confirm that the disable_on_error = false in the DisableSubscriptionOnError,
> it's highly possible that we'll get more same original errors from new apply workers.
>
> This leads to another question, we should suppress the 2nd error(if there is),
> even when it's highly possible that we'll get more same errors by new apply workers
> created repeatedly or not. I wasn't sure if the implementation complexity for this wins the log print.
>
> So, kindly let me keep the current code as is.
> If someone wants share his/her opinion on this, please let me know.

OK, but is it really correct that this scenario can happen "When we
alter subscription with disable_on_error = false after we go into the
DisableSubscriptionOnError". Actually, I thought this window may be
much bigger than that - e.g. maybe we changed the option to false at
*any* time after the worker was originally started and the original
option values were got by GetSubscription function (and that might be
hours/days/weeks ago since it started).

>
> > >
> > >
> > > > +static void
> > > > +DisableSubscriptionOnError(void)
> > > > +{
> > > > + Relation rel;
> > > > + bool nulls[Natts_pg_subscription];  bool
> > > > +replaces[Natts_pg_subscription];  Datum
> > > > +values[Natts_pg_subscription];  HeapTuple tup;
> > > > +Form_pg_subscription subform;
> > > > +
> > > > + /* Emit the error */
> > > > + EmitErrorReport();
> > > > + /* Abort any active transaction */ AbortOutOfAnyTransaction();
> > > > + /* Reset the ErrorContext */
> > > > + FlushErrorState();
> > > > +
> > > > + /* Disable the subscription in a fresh transaction */
> > > > + StartTransactionCommand();
> > > >
> > > > If this DisableSubscriptionOnError function decides later that
> > > > actually the 'disableonerr' flag is false (i.e. it's NOT going to
> > > > disable the subscription after
> > > > all) then IMO it make more sense that the error logging for that
> > > > case should just do whatever it is doing now by the normal error processing
> > mechanism.
> > > >
> > > > In other words, I thought perhaps the code to
> > > > EmitErrorReport/FlushError state etc be moved to be BELOW the if
> > > > (!subform->subdisableonerr) bail-out code?
> > > >
> > > > Please see what you think in my attached POC [1]. It seems neater to
> > > > me, and tests are all OK. Maybe I am mistaken...
> > > I had a concern that this order change of codes would have a negative
> > > impact when we have another new error during the call of
> > DisableSubscriptionOnError.
> > >
> > > With the debugger, I raised an error in this function before emitting the
> > original error.
> > > As a result, the original error that makes the apply worker go into
> > > the path of DisableSubscriptionOnError (in my test, duplication error) has
> > vanished.
> > > In this sense, v19 looks safer, and the current order to handle error
> > > recovery first looks better to me.
> > >
> > > FYI, after the 2nd debugger error,
> > > the next new apply worker created quickly met the same type of error,
> > > went into the same path, and disabled the subscription with the log.
> > > But, it won't be advisable to let the possibility left.
> >
> > OK - thanks for checking it.
> >
> > Will it be better to put some comments about that? Something like --
> >
> > BEFORE
> > /* Emit the error */
> > EmitErrorReport();
> > /* Abort any active transaction */
> > AbortOutOfAnyTransaction();
> > /* Reset the ErrorContext */
> > FlushErrorState();
> >
> > /* Disable the subscription in a fresh transaction */
> > StartTransactionCommand();
> >
> > AFTER
> > /* Disable the subscription in a fresh transaction */
> > AbortOutOfAnyTransaction(); StartTransactionCommand();
> >
> > /*
> > * Log the error that caused DisableSubscriptionOnError to be called.
> > * We do this immediately so that it won't be lost if some other internal
> > * error occurs in the following code,
> > */
> > EmitErrorReport();
> > FlushErrorState();
> I appreciate your suggestion. Yet, I'd like to keep the current order of my patch.
> The FlushErrorState's comment mentions we are not out of the error subsystem until
> we call this and starting a new transaction before it didn't sound a good idea.
> But, I've fixed the comments around this. The indentation for new comments
> are checked by pgindent.

OK.

======

Here are a couple more review comments for v21.

~~~

1. worker.c - comment

+ subform = (Form_pg_subscription) GETSTRUCT(tup);
+
+ /*
+ * We would not be here unless this subscription's disableonerr field was
+ * true, but check whether that field has changed in the interim.
+ */
+ if (!subform->subdisableonerr)
+ {
+ heap_freetuple(tup);
+ table_close(rel, RowExclusiveLock);
+ CommitTransactionCommand();
+ return false;
+ }

I felt that comment belongs above the subform assignment because that
is the only reason we are getting the subform again.

~~

2. worker.c - subform->oid

+ /* Notify the subscription will be no longer valid */
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to an error",
+    MySubscription->name));
+
+ LockSharedObject(SubscriptionRelationId, subform->oid, 0,
AccessExclusiveLock);

Can't we just use MySubscription->oid here? We really only needed that
subform to get new option values.

~~

3. worker.c - whitespace

Your pg_indent has also changed some whitespace for parts of worker.c
that are completely unrelated to this patch. You might want to revert
those changes.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Design of pg_stat_subscription_workers vs pgstats
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Patch a potential memory leak in describeOneTableDetails()