Re: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Optionally automatically disable logical replication subscriptions on error
Date
Msg-id 0646E5E0-EDF2-4D74-89DD-1C9BF0D7EA67@enterprisedb.com
Whole thread Raw
In response to Re: Optionally automatically disable logical replication subscriptions on error  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers

> On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250@gmail.com> wrote:
>
> * Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
> at some point in future and then going into a relaunch loop for
> days/weeks and causing 1000's of errors without the user noticing. In
> that case, this patch seems to be quite useful, but for this goal
> maybe you don't want to be checking the tablesync workers at all, but
> should only be checking the apply worker like your original v1 patch
> did.

Yeah, my motivation was preventing an infinite loop, and providing a clean way for the users to know that replication
theyare waiting for won't ever complete, rather than having to infer that it will never halt.  

> * Is the goal just to be a convenient way to disable the subscription
> during the CREATE SUBSCRIPTION phase so that the user can make
> corrections in peace without the workers re-launching and making more
> error logs?

No.  This is not and never was my motivation.  It's an interesting question, but that idea never crossed my mind.  I'm
notsure what changes somebody would want to make *after* creating the subscription.  Certainly, there may be problems
withhow they have things set up, but they won't know that until the first error happens. 

> Here the patch is helpful, but only for simple scenarios
> like 1 faulty table. Imagine if there are 10 tables (all with PK
> violations at DATASYNC copy) then you will encounter them one at a
> time and have to re-enable the subscription 10 times, after fixing
> each error in turn.

You are assuming disable_on_error=true.  It is false by default.  But ok, let's accept that assumption for the sake of
argument. Now, will you have to manually go through the process 10 times?  I'm not sure.  The user might figure out
theirmistake after seeing the first error. 

> So in this scenario the new option might be more
> of a hindrance than a help because it would be easier if the user just
> did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
> problems in one sitting before re-enabling.

Yeah, but since the new option is off by default, I don't see any sensible complaint.

>
> * etc
>
> //////////
>
> Finally, here is one last (crazy?) thought-bubble just for
> consideration. I might be wrong, but my gut feeling is that the Stats
> Collector is intended more for "tracking" and for "metrics" rather
> than for holding duplicates of logged error messages. At the same
> time, I felt that disabling an entire subscription due to a single
> rogue error might be overkill sometimes.

I'm happy to entertain criticism of the particulars of how my patch approaches this problem, but it is already making a
distinctionbetween transient errors (resources, network, etc.) vs. ones that are non-transient.  Again, I might not
havedrawn the line in the right place, but the patch is not intended to disable subscriptions in response to transient
errors.

> But I wonder if there is a
> way to combine those two ideas so that the Stats Collector gets some
> new counter for tracking the number of worker re-launches that have
> occurred, meanwhile there could be a subscription option which gives a
> threshold above which you would disable the subscription.
> e.g.
> "disable_on_error_threshold=0" default, relaunch forever
> "disable_on_error_threshold=1" disable upon first error encountered.
> (This is how your patch behaves now I think.)
> "disable_on_error_threshold=500" disable if the re-launch errors go
> unattended and happen 500 times.

That sounds like a misfeature to me.  You could have a subscription that works fine for a month, surviving numerous
shortnetwork outages, but then gets autodisabled after a longer network outage.  I'm not sure why anybody would want
that. You might argue for exponential backoff, where it never gets autodisabled on transient errors, but retries less
frequently. But I don't want to expand the scope of this patch to include that, at least not without a lot more
evidencethat it is needed. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Use simplehash.h instead of dynahash in SMgr
Next
From: Bruce Momjian
Date:
Subject: Re: Maintaining a list of pgindent commits for "git blame" to ignore