RE: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers
From | osumi.takamichi@fujitsu.com |
---|---|
Subject | RE: Optionally automatically disable logical replication subscriptions on error |
Date | |
Msg-id | TYCPR01MB83732011D83A027C32A2802CED6A9@TYCPR01MB8373.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | RE: Optionally automatically disable logical replication subscriptions on error ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>) |
Responses |
Re: Optionally automatically disable logical replication subscriptions on error
|
List | pgsql-hackers |
Thursday, December 2, 2021 4:41 PM I wrote: > On Thursday, December 2, 2021 1:49 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com > > <osumi.takamichi@fujitsu.com> wrote: > > > > > > On Wednesday, December 1, 2021 10:16 PM Amit Kapila > > <amit.kapila16@gmail.com> wrote: > > > Updated the patch to include the notification. > > > > > The patch disables the subscription for non-transient errors. I am not > > sure if we can easily make the call to decide whether any particular > > error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY > might not rectify itself. > > Why not just allow to disable the subscription on any error? And then > > let the user check the error either in view or logs and decide whether > > it would like to enable the subscription or do something before it > > (like making space in disk, or fixing the network). > Agreed. I'll treat any errors as the trigger of the feature in the next version. > > > The other problem I see with this transient error stuff is maintaining > > the list of error codes that we think are transient. I think we need a > > discussion for each of the error_codes we are listing now and whatever > > new error_code we add in the future which doesn't seem like a good idea. > This is also true. The maintenance cost of my current implementation didn't > sound cheap. > > > I think the code to deal with apply worker errors and then disable the > > subscription has some flaws. Say, while disabling the subscription if > > it leads to another error then I think the original error won't be > > reported. Can't we simply emit the error via EmitErrorReport and then > > do AbortOutOfAnyTransaction, FlushErrorState, and any other memory > > context clean up if required and then disable the subscription after coming > out of catch? > You are right. I'll fix related parts accordingly. Hi, I've made a new patch v11 that incorporated suggestions described above. There are several notes to share regarding v11 modifications. 1. Modified the commit message a bit. 2. DisableSubscriptionOnError() doesn't return ErrData anymore, since now to emit error message is done in the error recovery area and the function purpose has become purely to run a transaction to disable the subscription. 3. In DisableSubscriptionOnError(), v10 rethrew the error if the disable_on_error flag became false in the interim, but v11 just closes the transaction and finishes the function. 4. If table sync worker detects an error during synchronization and needs to disable the subscription, the worker disables it and just exit by proc_exit. The processing after disabling the subscription didn't look necessary to me for disabled subscription. 5. Only when we succeed in the table synchronization, it's necessary to allocate slot name in long-lived context, after the table synchronization in ApplyWorkerMain(). Otherwise, we'll see junk value of syncslotname because it is the return value of LogicalRepSyncTableStart(). 6. There are 3 places for error recovery in ApplyWorkerMain(). All of those might look similar but I didn't make an united function for them. Those are slightly different from each other and I felt readability is reduced by grouping them into one type of function call. 7. In v11, I covered the case that apply worker failed with apply_error_callback_arg.command == 0, as one path to disable the subscription in order to cover all errors. 8. I changed one flag name from 'disable_subscription' to 'did_error' in ApplyWorkerMain(). 9. All chages in this version are C codes and checked by pgindent. Best Regards, Takamichi Osumi
Attachment
pgsql-hackers by date: