Re: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Optionally automatically disable logical replication subscriptions on error |
Date | |
Msg-id | CAHut+PtrbRA_u8Z13mmA2b+PnUE0hUVkyx+q0dfc8CpBBzzKbw@mail.gmail.com Whole thread Raw |
In response to | RE: Optionally automatically disable logical replication subscriptions on error ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>) |
Responses |
Re: Optionally automatically disable logical replication subscriptions on error
RE: Optionally automatically disable logical replication subscriptions on error |
List | pgsql-hackers |
On Tue, Feb 22, 2022 at 3:11 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote: > > On Tuesday, February 22, 2022 7:53 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > On Monday, February 21, 2022 2:56 PM Peter Smith > > <smithpb2250@gmail.com> wrote: > > > > Thanks for addressing my previous comments. Now I have looked at v19. > > > > > > > > On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com > > > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > > > > > On Friday, February 18, 2022 3:27 PM Peter Smith > > > > <smithpb2250@gmail.com> wrote: > > > > > > Hi. Below are my code review comments for v18. > > > > > Thank you for your review ! > > > > ... > > > > > > 5. src/backend/replication/logical/worker.c - > > > > > > DisableSubscriptionOnError > > > > > > > > > > > > + /* > > > > > > + * We would not be here unless this subscription's disableonerr > > > > > > + field was > > > > > > + * true when our worker began applying changes, but check > > > > > > + whether that > > > > > > + * field has changed in the interim. > > > > > > + */ > > > > > > > > > > > > Apparently, this function might just do nothing if it detects > > > > > > some situation where the flag was changed somehow, but I'm not > > > > > > 100% sure that the callers are properly catering for when nothing > > happens. > > > > > > > > > > > > IMO it would be better if this function would return true/false > > > > > > to mean "did disable subscription happen or not?" because that > > > > > > will give the calling code the chance to check the function > > > > > > return and do the right thing - e.g. if the caller first thought > > > > > > it should be disabled but then > > > > it turned out it did NOT disable... > > > > > I don't think we need to do something more. > > > > > After this function, table sync worker and the apply worker just exit. > > > > > IMO, we don't need to do additional work for already-disabled > > > > > subscription on the caller sides. > > > > > It should be sufficient to fulfill the purpose of > > > > > DisableSubscriptionOnError or confirm it has been fulfilled. > > > > > > > > Hmmm - Yeah, it may be the workers might just exit soon after > > > > anyhow as you say so everything comes out in the wash, but still, I > > > > felt for this case when DisableSubscriptionOnError turned out to do > > > > nothing it would be better to exit via the existing logic. And that is easy to do > > if the function returns true/false. > > > > > > > > For example, changes like below seemed neater code to me. YMMV. > > > > > > > > BEFORE (SyncTableStartWrapper): > > > > if (MySubscription->disableonerr) > > > > { > > > > DisableSubscriptionOnError(); > > > > proc_exit(0); > > > > } > > > > AFTER > > > > if (MySubscription->disableonerr && DisableSubscriptionOnError()) > > > > proc_exit(0); > > > > > > > > BEFORE (ApplyLoopWrapper) > > > > if (MySubscription->disableonerr) > > > > { > > > > /* Disable the subscription */ > > > > DisableSubscriptionOnError(); > > > > return; > > > > } > > > > AFTER > > > > if (MySubscription->disableonerr && DisableSubscriptionOnError()) > > > > return; > > > Okay, so this return value works for better readability. > > > Fixed. > > > > > > > > > > ~~~ > > > > > > > > Here are a couple more comments: > > > > > > > > 1. src/backend/replication/logical/worker.c - > > > > DisableSubscriptionOnError, Refactor error handling > > > > > > > > (this comment assumes the above gets changed too) > > > I think those are independent. > > > > OK. I was only curious if the change #5 above might cause the error to be logged > > 2x, if the DisableSubscriptionOnError returns false. > > - firstly, when it logs errors within the function > > - secondly, by normal error mechanism when the caller re-throws it. > > > > But, if you are sure that won't happen then it is good news. > I didn't feel this would become a substantial issue. > > When we alter subscription with disable_on_error = false > after we go into the DisableSubscriptionOnError, > we don't disable the subscription in the same function. > That means we launch new apply workers repeatedly after that > until we solve the error cause or we set the disable_on_error = true again. > > So, if we confirm that the disable_on_error = false in the DisableSubscriptionOnError, > it's highly possible that we'll get more same original errors from new apply workers. > > This leads to another question, we should suppress the 2nd error(if there is), > even when it's highly possible that we'll get more same errors by new apply workers > created repeatedly or not. I wasn't sure if the implementation complexity for this wins the log print. > > So, kindly let me keep the current code as is. > If someone wants share his/her opinion on this, please let me know. OK, but is it really correct that this scenario can happen "When we alter subscription with disable_on_error = false after we go into the DisableSubscriptionOnError". Actually, I thought this window may be much bigger than that - e.g. maybe we changed the option to false at *any* time after the worker was originally started and the original option values were got by GetSubscription function (and that might be hours/days/weeks ago since it started). > > > > > > > > > > > +static void > > > > +DisableSubscriptionOnError(void) > > > > +{ > > > > + Relation rel; > > > > + bool nulls[Natts_pg_subscription]; bool > > > > +replaces[Natts_pg_subscription]; Datum > > > > +values[Natts_pg_subscription]; HeapTuple tup; > > > > +Form_pg_subscription subform; > > > > + > > > > + /* Emit the error */ > > > > + EmitErrorReport(); > > > > + /* Abort any active transaction */ AbortOutOfAnyTransaction(); > > > > + /* Reset the ErrorContext */ > > > > + FlushErrorState(); > > > > + > > > > + /* Disable the subscription in a fresh transaction */ > > > > + StartTransactionCommand(); > > > > > > > > If this DisableSubscriptionOnError function decides later that > > > > actually the 'disableonerr' flag is false (i.e. it's NOT going to > > > > disable the subscription after > > > > all) then IMO it make more sense that the error logging for that > > > > case should just do whatever it is doing now by the normal error processing > > mechanism. > > > > > > > > In other words, I thought perhaps the code to > > > > EmitErrorReport/FlushError state etc be moved to be BELOW the if > > > > (!subform->subdisableonerr) bail-out code? > > > > > > > > Please see what you think in my attached POC [1]. It seems neater to > > > > me, and tests are all OK. Maybe I am mistaken... > > > I had a concern that this order change of codes would have a negative > > > impact when we have another new error during the call of > > DisableSubscriptionOnError. > > > > > > With the debugger, I raised an error in this function before emitting the > > original error. > > > As a result, the original error that makes the apply worker go into > > > the path of DisableSubscriptionOnError (in my test, duplication error) has > > vanished. > > > In this sense, v19 looks safer, and the current order to handle error > > > recovery first looks better to me. > > > > > > FYI, after the 2nd debugger error, > > > the next new apply worker created quickly met the same type of error, > > > went into the same path, and disabled the subscription with the log. > > > But, it won't be advisable to let the possibility left. > > > > OK - thanks for checking it. > > > > Will it be better to put some comments about that? Something like -- > > > > BEFORE > > /* Emit the error */ > > EmitErrorReport(); > > /* Abort any active transaction */ > > AbortOutOfAnyTransaction(); > > /* Reset the ErrorContext */ > > FlushErrorState(); > > > > /* Disable the subscription in a fresh transaction */ > > StartTransactionCommand(); > > > > AFTER > > /* Disable the subscription in a fresh transaction */ > > AbortOutOfAnyTransaction(); StartTransactionCommand(); > > > > /* > > * Log the error that caused DisableSubscriptionOnError to be called. > > * We do this immediately so that it won't be lost if some other internal > > * error occurs in the following code, > > */ > > EmitErrorReport(); > > FlushErrorState(); > I appreciate your suggestion. Yet, I'd like to keep the current order of my patch. > The FlushErrorState's comment mentions we are not out of the error subsystem until > we call this and starting a new transaction before it didn't sound a good idea. > But, I've fixed the comments around this. The indentation for new comments > are checked by pgindent. OK. ====== Here are a couple more review comments for v21. ~~~ 1. worker.c - comment + subform = (Form_pg_subscription) GETSTRUCT(tup); + + /* + * We would not be here unless this subscription's disableonerr field was + * true, but check whether that field has changed in the interim. + */ + if (!subform->subdisableonerr) + { + heap_freetuple(tup); + table_close(rel, RowExclusiveLock); + CommitTransactionCommand(); + return false; + } I felt that comment belongs above the subform assignment because that is the only reason we are getting the subform again. ~~ 2. worker.c - subform->oid + /* Notify the subscription will be no longer valid */ + ereport(LOG, + errmsg("logical replication subscription \"%s\" will be disabled due to an error", + MySubscription->name)); + + LockSharedObject(SubscriptionRelationId, subform->oid, 0, AccessExclusiveLock); Can't we just use MySubscription->oid here? We really only needed that subform to get new option values. ~~ 3. worker.c - whitespace Your pg_indent has also changed some whitespace for parts of worker.c that are completely unrelated to this patch. You might want to revert those changes. ------ Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: