RE: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers

From osumi.takamichi@fujitsu.com
Subject RE: Optionally automatically disable logical replication subscriptions on error
Date
Msg-id TYCPR01MB83732011D83A027C32A2802CED6A9@TYCPR01MB8373.jpnprd01.prod.outlook.com
Whole thread Raw
In response to RE: Optionally automatically disable logical replication subscriptions on error  ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
Responses Re: Optionally automatically disable logical replication subscriptions on error
List pgsql-hackers
Thursday, December 2, 2021 4:41 PM I wrote:
> On Thursday, December 2, 2021 1:49 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Wednesday, December 1, 2021 10:16 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > > Updated the patch to include the notification.
> > >
> > The patch disables the subscription for non-transient errors. I am not
> > sure if we can easily make the call to decide whether any particular
> > error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
> might not rectify itself.
> > Why not just allow to disable the subscription on any error? And then
> > let the user check the error either in view or logs and decide whether
> > it would like to enable the subscription or do something before it
> > (like making space in disk, or fixing the network).
> Agreed. I'll treat any errors as the trigger of the feature in the next version.
> 
> > The other problem I see with this transient error stuff is maintaining
> > the list of error codes that we think are transient. I think we need a
> > discussion for each of the error_codes we are listing now and whatever
> > new error_code we add in the future which doesn't seem like a good idea.
> This is also true. The maintenance cost of my current implementation didn't
> sound cheap.
> 
> > I think the code to deal with apply worker errors and then disable the
> > subscription has some flaws. Say, while disabling the subscription if
> > it leads to another error then I think the original error won't be
> > reported.  Can't we simply emit the error via EmitErrorReport and then
> > do AbortOutOfAnyTransaction, FlushErrorState, and any other memory
> > context clean up if required and then disable the subscription after coming
> out of catch?
> You are right. I'll fix related parts accordingly.
Hi, I've made a new patch v11 that incorporated suggestions described above.

There are several notes to share regarding v11 modifications.

1. Modified the commit message a bit.

2. DisableSubscriptionOnError() doesn't return ErrData anymore,
since now to emit error message is done in the error recovery area
and the function purpose has become purely to run a transaction to disable
the subscription.

3. In DisableSubscriptionOnError(), v10 rethrew the error if the disable_on_error
flag became false in the interim, but v11 just closes the transaction and
finishes the function.

4. If table sync worker detects an error during synchronization
and needs to disable the subscription, the worker disables it and just exit by proc_exit.
The processing after disabling the subscription didn't look necessary to me
for disabled subscription.

5. Only when we succeed in the table synchronization, it's necessary to
allocate slot name in long-lived context, after the table synchronization in
ApplyWorkerMain(). Otherwise, we'll see junk value of syncslotname
because it is the return value of LogicalRepSyncTableStart().

6. There are 3 places for error recovery in ApplyWorkerMain().
All of those might look similar but I didn't make an united function for them.
Those are slightly different from each other and I felt
readability is reduced by grouping them into one type of function call.

7. In v11, I covered the case that apply worker failed with
apply_error_callback_arg.command == 0, as one path to disable the subscription
in order to cover all errors.

8. I changed one flag name from 'disable_subscription' to 'did_error'
in ApplyWorkerMain().

9. All chages in this version are C codes and checked by pgindent.

Best Regards,
    Takamichi Osumi


Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Some RELKIND macro refactoring
Next
From: Amit Langote
Date:
Subject: Re: pg_get_publication_tables() output duplicate relid