Thread: Optionally automatically disable logical replication subscriptions on error

Hackers,

Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a
change,triggering an error (such as a constraint violation), exiting with an error written to the subscription worker
log,and restarting. 

As things currently stand, only superusers can create subscriptions.  Ongoing work to delegate superuser tasks to
non-superuserscreates the potential for even more errors to be triggered, specifically, errors where the apply worker
doesnot have permission to make changes to the target table. 

The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error",
suchthat rather than going into an infinite loop, the apply worker will catch errors and automatically disable the
subscription,breaking the loop.  The new parameter defaults to false.  When false, the PG_TRY/PG_CATCH overhead is
avoided,so for subscriptions which do not use the feature, there shouldn't be any change.  Users can manually clear the
errorafter fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.  

In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler.  I
originallyran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker
logfile to determine whether permission failures were occurring and what they were.  It was also obnoxiously easy to
havea test get stuck waiting for a permanently stuck subscription to catch up.  This helps with both issues. 

I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design
changes.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment
On Fri, Jun 18, 2021 at 1:48 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> Hackers,
>
> Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply
achange, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker
log,and restarting. 
>
> As things currently stand, only superusers can create subscriptions.  Ongoing work to delegate superuser tasks to
non-superuserscreates the potential for even more errors to be triggered, specifically, errors where the apply worker
doesnot have permission to make changes to the target table. 
>
> The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error",
suchthat rather than going into an infinite loop, the apply worker will catch errors and automatically disable the
subscription,breaking the loop.  The new parameter defaults to false.  When false, the PG_TRY/PG_CATCH overhead is
avoided,so for subscriptions which do not use the feature, there shouldn't be any change.  Users can manually clear the
errorafter fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command. 
>

I see this idea has merits and it will help users to repair failing
subscriptions. Few points on a quick look at the patch: (a) The patch
seem to be assuming that the error can happen only by the apply worker
but I think the constraint violation can happen via one of the table
sync workers as well, (b) What happens if the error happens when you
are updating the error information in the catalog table. I think
instead of seeing the actual apply time error, the user might see some
other for which it won't be clear what is an appropriate action.

We are also discussing another action like skipping the apply of the
transaction on an error [1]. I think it is better to evaluate both the
proposals as one seems to be an extension of another. Adding
Sawada-San, as he is working on the other proposal.

[1] - https://www.postgresql.org/message-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK%3D30xJfUVihNZDA%40mail.gmail.com

--
With Regards,
Amit Kapila.



On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> Hackers,
>
> Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply
achange, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker
log,and restarting. 
>
> As things currently stand, only superusers can create subscriptions.  Ongoing work to delegate superuser tasks to
non-superuserscreates the potential for even more errors to be triggered, specifically, errors where the apply worker
doesnot have permission to make changes to the target table. 
>
> The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error",
suchthat rather than going into an infinite loop, the apply worker will catch errors and automatically disable the
subscription,breaking the loop.  The new parameter defaults to false.  When false, the PG_TRY/PG_CATCH overhead is
avoided,so for subscriptions which do not use the feature, there shouldn't be any change.  Users can manually clear the
errorafter fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command. 
>
> In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler.  I
originallyran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker
logfile to determine whether permission failures were occurring and what they were.  It was also obnoxiously easy to
havea test get stuck waiting for a permanently stuck subscription to catch up.  This helps with both issues. 
>
> I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design
changes.

I tried your patch.

It applied OK (albeit with whitespace warnings).

The code build and TAP tests are all OK.

Below are a few comments and observations.

COMMENTS
========

(1) PG Docs catalogs.sgml

Documented new column "suberrmsg" but did not document the other new
columns ("disable_on_error", "disabled_by_error")?

------

(2) New column "disabled_by_error".

I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.

------

(3) New columns "disabled_by_error", "disabled_on_error".

All other columns of the pg_subscription have a "sub" prefix.

------

(4) errhint member used?

@@ -91,12 +100,16 @@ typedef struct Subscription
  char    *name; /* Name of the subscription */
  Oid owner; /* Oid of the subscription owner */
  bool enabled; /* Indicates if the subscription is enabled */
+ bool disable_on_error; /* Whether errors automatically disable */
+ bool disabled_by_error; /* Whether an error has disabled */
  bool binary; /* Indicates if the subscription wants data in
  * binary format */
  bool stream; /* Allow streaming in-progress transactions. */
  char    *conninfo; /* Connection string to the publisher */
  char    *slotname; /* Name of the replication slot */
  char    *synccommit; /* Synchronous commit setting for worker */
+ char    *errmsg; /* Message from error which disabled */
+ char    *errhint; /* Hint from error which disabled */
  List    *publications; /* List of publication names to subscribe to */
 } Subscription;

I did not find any code using that newly added member "errhint".

------

(5) dump.c

i. No mention of new columns "disabled_on_error" and
"disabled_by_error". Is that right?

ii. Shouldn't the code for the "suberrmsg" be qualified with some PG
version number checks?

------

(6) Patch only handles errors only from the Apply worker.

Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)

------

(7) TAP test code

+$node_subscriber->init(allows_streaming => 'logical');

AFAIK that "logical" configuration is not necessary for the subscriber side. So,

$node_subscriber->init();

////////////


Some Experiments/Observations
==============================

In general, I found this functionality is useful and it works as
advertised by your patch comment.

======

Test: Display pg_subscription with the new columns
Observation: As expected. But some new colnames are not prefixed like
their peers.

test_sub=# \pset x
Expanded display is on.
test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+--------------------------------------------------------
oid               | 16394
subdbid           | 16384
subname           | tap_sub
subowner          | 10
subenabled        | t
disable_on_error  | t
disabled_by_error | f
subbinary         | f
substream         | f
subconninfo       | host=localhost dbname=test_pub application_name=tap_sub
subslotname       | tap_sub
subsynccommit     | off
suberrmsg         |
subpublications   | {tap_pub}

======

Test: Cause a PK violation during normal Apply replication (when
"disabled_on_error=true")
Observation: Apply worker stops. Subscription is disabled. Error
message is in the catalog.

2021-06-18 15:12:45.905 AEST [25904] LOG:  edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:12:45.905 AEST [25904] LOG:  logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:12:45.905 AEST [25904] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:12:45.905 AEST [25904] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:12:45.908 AEST [19924] LOG:  background worker "logical
replication worker" (PID 25904) exited with exit code 1

test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+---------------------------------------------------------------
oid               | 16394
subdbid           | 16384
subname           | tap_sub
subowner          | 10
subenabled        | f
disable_on_error  | t
disabled_by_error | t
subbinary         | f
substream         | f
subconninfo       | host=localhost dbname=test_pub application_name=tap_sub
subslotname       | tap_sub
subsynccommit     | off
suberrmsg         | duplicate key value violates unique constraint
"test_tab_pkey"
subpublications   | {tap_pub}

======

Test: Try to enable subscription (without fixing the PK violation problem).
Observation. OK. It just stops again

test_sub=# alter subscription tap_sub enable;
ALTER SUBSCRIPTION
test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG:  logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:17:18.078 AEST [10228] LOG:  edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:17:18.078 AEST [10228] LOG:  logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:17:18.078 AEST [10228] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:17:18.078 AEST [10228] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:17:18.079 AEST [19924] LOG:  background worker "logical
replication worker" (PID 10228) exited with exit code 1

======

Test: Manually disable the subscription (which had previously already
been disabled due to error)
Observation: OK. The suberrmsg gets reset to an empty string.

alter subscription tap_sub disable;

=====

Test: Turn off the disable_on_error
Observation: As expected, now the Apply worker goes into re-launch
forever loop every time it hits PK violation

test_sub=# alter subscription tap_sub set (disable_on_error=false);
ALTER SUBSCRIPTION

...

======

Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.

test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH
(disable_on_error=false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG:  logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:38:19.557 AEST [18207] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:19.610 AEST [18207] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:19.610 AEST [18207] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:38:19.610 AEST [18207] CONTEXT:  COPY test_tab, line 1
2021-06-18 15:38:19.611 AEST [19924] LOG:  background worker "logical
replication worker" (PID 18207) exited with exit code 1
2021-06-18 15:38:24.634 AEST [18369] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:24.689 AEST [18369] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:24.689 AEST [18369] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:38:24.689 AEST [18369] CONTEXT:  COPY test_tab, line 1
2021-06-18 15:38:24.690 AEST [19924] LOG:  background worker "logical
replication worker" (PID 18369) exited with exit code 1
2021-06-18 15:38:29.701 AEST [18521] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:29.765 AEST [18521] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:29.765 AEST [18521] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:38:29.765 AEST [18521] CONTEXT:  COPY test_tab, line 1
2021-06-18 15:38:29.766 AEST [19924] LOG:  background worker "logical
replication worker" (PID 18521) exited with exit code 1
etc...


-[ RECORD 1 ]-----+--------------------------------------------------------
oid               | 16399
subdbid           | 16384
subname           | tap_sub
subowner          | 10
subenabled        | t
disable_on_error  | f
disabled_by_error | f
subbinary         | f
substream         | f
subconninfo       | host=localhost dbname=test_pub application_name=tap_sub
subslotname       | tap_sub
subsynccommit     | off
suberrmsg         |
subpublications   | {tap_pub}


------
Kind Regards,
Peter Smith.
Fujitsu Australia




> On Jun 17, 2021, at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> (a) The patch
> seem to be assuming that the error can happen only by the apply worker
> but I think the constraint violation can happen via one of the table
> sync workers as well

You are right.  Peter mentioned the same thing, and it is clearly so.  I am working to repair this fault in v2 of the
patch.

> (b) What happens if the error happens when you
> are updating the error information in the catalog table.

I think that is an entirely different kind of error.  The patch attempts to catch errors caused by the user, not by
corefunctionality of the system failing.  If there is a fault that prevents the catalogs from being updated, it is
unclearwhat the patch can do about that. 

> I think
> instead of seeing the actual apply time error, the user might see some
> other for which it won't be clear what is an appropriate action.

Good point.

Before trying to do much of anything with the caught error, the v2 patch logs the error.  If the subsequent efforts to
disablethe subscription fail, at least the logs should contain the initial failure message.  The v1 patch emitted a log
messagemuch further down, and really just intended for debugging the patch itself, with many opportunities for
somethingelse to throw before the log is written. 

> We are also discussing another action like skipping the apply of the
> transaction on an error [1]. I think it is better to evaluate both the
> proposals as one seems to be an extension of another.

Thanks for the link.

I think they are two separate options.  For some users and data patterns, subscriber-side skipping of specific
problematiccommits will be fine.  For other usage patterns, skipping earlier commits will results in more and more data
integrityproblems (foreign key references, etc.) such that the failures will snowball with skipping becoming the norm
ratherthan the exception.  Users with those usage patterns would likely prefer the subscription to automatically be
disableduntil manual intervention can clean up the problem. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







> On Jun 17, 2021, at 11:34 PM, Peter Smith <smithpb2250@gmail.com> wrote:
>
> I tried your patch.

Thanks for the quick and thorough review!

> (2) New column "disabled_by_error".
>
> I wondered if there was actually any need for this column. Isn't the
> same information conveyed by just having "subenabled" = false, at same
> time as as non-empty "suberrmsg"? This would remove any confusion for
> having 2 booleans which both indicate disabled.

Yeah, I wondered about that before posting v1.  I removed the disabled_by_error field for v2.

> (3) New columns "disabled_by_error", "disabled_on_error".
>
> All other columns of the pg_subscription have a "sub" prefix.

I don't feel strongly about this.  How about "subdisableonerr"?  I used that in v2.

> I did not find any code using that newly added member "errhint".

Thanks for catching that.  I had tried to remove all references to "errhint" before posting v1.  The original idea was
thatboth the message and hint of the error would be kept, but in testing I found the hint field was typically empty, so
Iremoved it.  Sorry that I left one mention of it lying around. 

> (5) dump.c

I didn't bother getting pg_dump working before posting v1, and I still have not done so, as I mainly want to solicit
feedbackon whether the basic direction I am going will work for the community. 

> (6) Patch only handles errors only from the Apply worker.
>
> Tablesync can give similar errors (e.g. PK violation during DATASYNC
> phase) which will trigger re-launch forever regardless of the setting
> of "disabled_on_error".
> (confirmed by observations below)

Yes, this is a good point, and also mentioned by Amit.  I have fixed it in v2 and adjusted the regression test to
triggeran automatic disabling for initial table sync as well as for change replication. 

> 2021-06-18 15:12:45.905 AEST [25904] LOG:  edata is true for
> subscription 'tap_sub': message = "duplicate key value violates unique
> constraint "test_tab_pkey"", hint = "<NONE>"

You didn't call this out, but FYI, I don't intend to leave this particular log message in the patch.  It was for
developmentonly.  I have removed it for v2 and have added a different log message much sooner after catching the error,
toavoid squashing the error in case some other action fails. 

The regression test shows this, if you open tmp_check/log/022_disable_on_error_subscriber.log:

2021-06-18 16:25:20.138 PDT [56926] LOG:  logical replication subscription "s1" will be disabled due to error:
duplicatekey value violates unique constraint "s1_tbl_unique" 
2021-06-18 16:25:20.139 PDT [56926] ERROR:  duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL:  Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT:  COPY tbl, line 2

The first line logs the error prior to attempting to disable the subscription, and the next three lines are due to
rethrowingthe error after committing the successful disabling of the subscription.  If the attempt to disable the
subscriptionitself throws, these additional three lines won't show up, but the first one should.  Amit mentioned this
upthread. Do you think this will be ok, or would you like to also have a suberrdetail field so that the detail doesn't
getlost?  I haven't added such an extra field, and am inclined to think it would be excessive, but maybe others feel
differently?


> ======
>
> Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
> (when disable_on_error = true)
> Observation: This patch changes nothing for this case. The Tablesyn
> re-launchs in a forever loop the same as current functionality.

In v2, tablesync copy errors should also be caught.  The test has been extended to cover this also.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment
On Sat, Jun 19, 2021 at 1:06 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> > On Jun 17, 2021, at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > We are also discussing another action like skipping the apply of the
> > transaction on an error [1]. I think it is better to evaluate both the
> > proposals as one seems to be an extension of another.
>
> Thanks for the link.
>
> I think they are two separate options.
>

Right, but there are things that could be common from the design
perspective. For example, why is it mandatory to update this conflict
( error) information in the system catalog instead of displaying it
via some stats view? Also, why not also log the xid of the failed
transaction?

-- 
With Regards,
Amit Kapila.




> On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Right, but there are things that could be common from the design
> perspective.

I went to reconcile my patch with that from [1] only to discover there is no patch on that thread.  Is there one in
progressthat I can see? 

I don't mind trying to reconcile this patch with what you're discussing in [1], but I am a bit skeptical about [1]
becominga reality and I don't want to entirely hitch this patch to that effort.  This can be committed with or without
anysolution to the idea in [1].  The original motivation for this patch was that the TAP tests don't have a great way
todeal with a subscription getting into a fail-retry infinite loop, which makes it harder for me to make progress on
[2]. That doesn't absolve me of the responsibility of making this patch a good one, but it does motivate me to keep it
simple.

> For example, why is it mandatory to update this conflict
> ( error) information in the system catalog instead of displaying it
> via some stats view?

The catalog must be updated to disable the subscription, so placing the error information in the same row doesn't
requireany independent touching of the catalogs.  Likewise, the catalog must be updated to re-enable the subscription,
soclearing the error from that same row doesn't require any independent touching of the catalogs. 

The error information does not *need* to be included in the catalog, but placing the information in any location that
won'tsurvive server restart leaves the user no information about why the subscription got disabled after a restart (or
crash+ restart) happens. 

Furthermore, since v2 removed the "disabled_by_error" field in favor of just using subenabled + suberrmsg to determine
ifthe subscription was automatically disabled, not having the information in the catalog would make it ambiguous
whetherthe subscription was manually or automatically disabled. 

> Also, why not also log the xid of the failed
> transaction?

We could also do that.  Reading [1], it seems you are overly focused on user-facing xids.  The errdetail in the
examplesI've been using for testing, and the one mentioned in [1], contain information about the conflicting data.  I
thinkusers are more likely to understand that a particular primary key value cannot be replicated because it is not
uniquethan to understand that a particular xid cannot be replicated.  (Likewise for permissions errors.)  For example: 

2021-06-18 16:25:20.139 PDT [56926] ERROR:  duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL:  Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT:  COPY tbl, line 2

This tells the user what they need to clean up before they can continue.  Telling them which xid tried to apply the
change,but not the change itself or the conflict itself, seems rather unhelpful.  So at best, the xid is an additional
pieceof information.  I'd rather have both the ERROR and DETAIL fields above and not the xid than have the xid and lack
oneof those two fields.  Even so, I have not yet included the DETAIL field because I didn't want to bloat the catalog. 

For the problem in [1], having the xid is more important than it is in my patch, because the user is expected in [1] to
usethe xid as a handle.  But that seems like an odd interface to me.  Imagine that a transaction on the publisher side
inserteda batch of data, and only a subset of that data conflicts on the subscriber side.  What advantage is there in
skippingthe entire transaction?  Wouldn't the user rather skip just the problematic rows?  I understand that on the
subscriberside it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to
allowthe user to filter out data that is problematic rather than filtering out xids that are problematic, and the
filtershouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or
rewriteit before inserting or change the pre-existing conflicting data prior to applying the problematic data or
whatever. That's a huge effort, of course, but if the idea in [1] goes in that direction, I don't want my patch to have
alreadyadded an xid field which ultimately nobody wants. 

[1] - https://www.postgresql.org/message-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK%3D30xJfUVihNZDA%40mail.gmail.com

[2] -
https://www.postgresql.org/message-id/flat/915B995D-1D79-4E0A-BD8D-3B267925FCE9%40enterprisedb.com#dbbce39c9e460183b67ee44b647b1209
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







> On Jun 19, 2021, at 7:44 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> Wouldn't the user rather skip just the problematic rows?  I understand that on the subscriber side it is difficult to
doso, but if you are going to implement this sort of thing, it makes more sense to allow the user to filter out data
thatis problematic rather than filtering out xids that are problematic, and the filter shouldn't just be an in-or-out
filter,but rather a mapping function that can redirect the data someplace else or rewrite it before inserting or change
thepre-existing conflicting data prior to applying the problematic data or whatever. 

Thinking about this some more, it seems my patch already sets the stage for this sort of thing.

We could extend the concept of triggers to something like ErrorTriggers that could be associated with subscriptions.  I
alreadyhave the code catching errors for subscriptions where disable_on_error is true.  We could use that same code
pathfor subscriptions that have one or more BEFORE or AFTER ErrorTriggers defined.  We could pass the trigger all the
errorcontext information along with the row and subscription information, and allow the trigger to either modify the
databeing replicated or make modifications to the table being changed.  I think having support for both BEFORE and
AFTERwould be important, as a common design pattern might be to move aside the conflicting rows in the BEFORE trigger,
thenreconcile and merge them back into the table in the AFTER trigger.  If the xid still cannot be replicated after one
attemptusing the triggers, the second attempt to disable the subscription instead. 

There are a lot of details to consider, but to my mind this idea is much more user friendly than the idea that users
shouldmuck about with xids for arbitrarily many conflicting transactions. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Sat, Jun 19, 2021 at 11:44 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Right, but there are things that could be common from the design
> > perspective.
>
> I went to reconcile my patch with that from [1] only to discover there is no patch on that thread.  Is there one in
progressthat I can see? 

I will submit the patch.

>
> I don't mind trying to reconcile this patch with what you're discussing in [1], but I am a bit skeptical about [1]
becominga reality and I don't want to entirely hitch this patch to that effort.  This can be committed with or without
anysolution to the idea in [1].  The original motivation for this patch was that the TAP tests don't have a great way
todeal with a subscription getting into a fail-retry infinite loop, which makes it harder for me to make progress on
[2]. That doesn't absolve me of the responsibility of making this patch a good one, but it does motivate me to keep it
simple.

There was a discussion that the skipping transaction patch would also
need to have a feature that tells users the details of the last
failure transaction such as its XID, timestamp, action etc. In that
sense, those two patches might need the common infrastructure that the
apply workers leave the error details somewhere so that the users can
see it.

> > For example, why is it mandatory to update this conflict
> > ( error) information in the system catalog instead of displaying it
> > via some stats view?
>
> The catalog must be updated to disable the subscription, so placing the error information in the same row doesn't
requireany independent touching of the catalogs.  Likewise, the catalog must be updated to re-enable the subscription,
soclearing the error from that same row doesn't require any independent touching of the catalogs. 
>
> The error information does not *need* to be included in the catalog, but placing the information in any location that
won'tsurvive server restart leaves the user no information about why the subscription got disabled after a restart (or
crash+ restart) happens. 
>
> Furthermore, since v2 removed the "disabled_by_error" field in favor of just using subenabled + suberrmsg to
determineif the subscription was automatically disabled, not having the information in the catalog would make it
ambiguouswhether the subscription was manually or automatically disabled. 

Is it really useful to write only error message to the system catalog?
Even if we see the error message like "duplicate key value violates
unique constraint “test_tab_pkey”” on the system catalog, we will end
up needing to check the server log for details to properly resolve the
conflict. If the user wants to know whether the subscription is
disabled manually or automatically, the error message on the system
catalog might not necessarily be necessary.

> For the problem in [1], having the xid is more important than it is in my patch, because the user is expected in [1]
touse the xid as a handle.  But that seems like an odd interface to me.  Imagine that a transaction on the publisher
sideinserted a batch of data, and only a subset of that data conflicts on the subscriber side.  What advantage is there
inskipping the entire transaction?  Wouldn't the user rather skip just the problematic rows?  I understand that on the
subscriberside it is difficult to do so, but if you are going to implement this sort of thing, it makes more sense to
allowthe user to filter out data that is problematic rather than filtering out xids that are problematic, and the
filtershouldn't just be an in-or-out filter, but rather a mapping function that can redirect the data someplace else or
rewriteit before inserting or change the pre-existing conflicting data prior to applying the problematic data or
whatever. That's a huge effort, of course, but if the idea in [1] goes in that direction, I don't want my patch to have
alreadyadded an xid field which ultimately nobody wants. 
>

The feature discussed in that thread is meant to be a repair tool for
the subscription in emergency cases when something that should not
have happened happened. I guess that resolving row (or column) level
conflict should be done in another way, for example, by defining
policies for each type of conflict.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




> On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I will submit the patch.

Great, thanks!

> There was a discussion that the skipping transaction patch would also
> need to have a feature that tells users the details of the last
> failure transaction such as its XID, timestamp, action etc. In that
> sense, those two patches might need the common infrastructure that the
> apply workers leave the error details somewhere so that the users can
> see it.

Right.  Subscription on error triggers would need that, too, if we wrote them.

> Is it really useful to write only error message to the system catalog?
> Even if we see the error message like "duplicate key value violates
> unique constraint “test_tab_pkey”” on the system catalog, we will end
> up needing to check the server log for details to properly resolve the
> conflict. If the user wants to know whether the subscription is
> disabled manually or automatically, the error message on the system
> catalog might not necessarily be necessary.

We can put more information in there.  I don't feel strongly about it.  I'll wait for your patch to see what
infrastructureyou need. 

> The feature discussed in that thread is meant to be a repair tool for
> the subscription in emergency cases when something that should not
> have happened happened. I guess that resolving row (or column) level
> conflict should be done in another way, for example, by defining
> policies for each type of conflict.

I understand that is the idea, but I'm having trouble believing it will work that way in practice.  If somebody has a
subscriptionthat has gone awry, what reason do we have to believe there will only be one transaction that will need to
bemanually purged?  It seems just as likely that there would be a million transactions that need to be purged, and
creatingan interface for users to manually review them and keep or discard on a case by case basis seems unworkable.
Sure,you might have specific cases where the number of transactions to purge is small, but I don't like designing the
featurearound that assumption. 

All the same, I'm looking forward to seeing your patch!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






On Sat, Jun 19, 2021 at 8:14 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> > On Jun 19, 2021, at 3:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Also, why not also log the xid of the failed
> > transaction?
>
> We could also do that.  Reading [1], it seems you are overly focused on user-facing xids.  The errdetail in the
examplesI've been using for testing, and the one mentioned in [1], contain information about the conflicting data.  I
thinkusers are more likely to understand that a particular primary key value cannot be replicated because it is not
uniquethan to understand that a particular xid cannot be replicated.  (Likewise for permissions errors.)  For example: 
>
> 2021-06-18 16:25:20.139 PDT [56926] ERROR:  duplicate key value violates unique constraint "s1_tbl_unique"
> 2021-06-18 16:25:20.139 PDT [56926] DETAIL:  Key (i)=(1) already exists.
> 2021-06-18 16:25:20.139 PDT [56926] CONTEXT:  COPY tbl, line 2
>
> This tells the user what they need to clean up before they can continue.  Telling them which xid tried to apply the
change,but not the change itself or the conflict itself, seems rather unhelpful.  So at best, the xid is an additional
pieceof information.  I'd rather have both the ERROR and DETAIL fields above and not the xid than have the xid and lack
oneof those two fields.  Even so, I have not yet included the DETAIL field because I didn't want to bloat the catalog. 
>

I never said that we don't need the error information. I think we need
xid along with other things.

> For the problem in [1], having the xid is more important than it is in my patch, because the user is expected in [1]
touse the xid as a handle.  But that seems like an odd interface to me.  Imagine that a transaction on the publisher
sideinserted a batch of data, and only a subset of that data conflicts on the subscriber side.  What advantage is there
inskipping the entire transaction?  Wouldn't the user rather skip just the problematic rows? 
>

I think skipping some changes but not others can make the final
transaction data inconsistent. Say, we have a case where, in a
transaction after insert, there is an update or delete on the same
row, then we might silently skip such updates/deletes unless the same
row is already present in the subscriber. I think skipping the entire
transaction based on user instruction would be safer than skipping
some changes that lead to an error.

--
With Regards,
Amit Kapila.



On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> > On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I will submit the patch.
>
> Great, thanks!
>
> > There was a discussion that the skipping transaction patch would also
> > need to have a feature that tells users the details of the last
> > failure transaction such as its XID, timestamp, action etc. In that
> > sense, those two patches might need the common infrastructure that the
> > apply workers leave the error details somewhere so that the users can
> > see it.
>
> Right.  Subscription on error triggers would need that, too, if we wrote them.
>
> > Is it really useful to write only error message to the system catalog?
> > Even if we see the error message like "duplicate key value violates
> > unique constraint “test_tab_pkey”” on the system catalog, we will end
> > up needing to check the server log for details to properly resolve the
> > conflict. If the user wants to know whether the subscription is
> > disabled manually or automatically, the error message on the system
> > catalog might not necessarily be necessary.
> >

I think the two key points are (a) to define exactly what all
information is required to be logged on error, (b) where do we want to
store the information based on requirements. I see that for (b) Mark
is inclined to use the existing catalog table. I feel that is worth
considering but not sure if that is the best way to deal with it. For
example, if we store that information in the catalog, we might need to
consider storing it both in pg_subscription and pg_subscription_rel,
otherwise, we might overwrite the errors as I think what is happening
in the currently proposed patch. The other possibilities could be to
define a new catalog table to capture the error information or log the
required information via stats collector and then the user can see
that info via some stats view.

>
> We can put more information in there.  I don't feel strongly about it.  I'll wait for your patch to see what
infrastructureyou need. 
>
> > The feature discussed in that thread is meant to be a repair tool for
> > the subscription in emergency cases when something that should not
> > have happened happened. I guess that resolving row (or column) level
> > conflict should be done in another way, for example, by defining
> > policies for each type of conflict.
>
> I understand that is the idea, but I'm having trouble believing it will work that way in practice.  If somebody has a
subscriptionthat has gone awry, what reason do we have to believe there will only be one transaction that will need to
bemanually purged? 
>

Because currently, we don't proceed after an error unless it is
resolved. Why do you think there could be multiple such transactions?

--
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Mon, Jun 21, 2021 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> >
> > > On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I will submit the patch.
> >
> > Great, thanks!
> >
> > > There was a discussion that the skipping transaction patch would also
> > > need to have a feature that tells users the details of the last
> > > failure transaction such as its XID, timestamp, action etc. In that
> > > sense, those two patches might need the common infrastructure that the
> > > apply workers leave the error details somewhere so that the users can
> > > see it.
> >
> > Right.  Subscription on error triggers would need that, too, if we wrote them.
> >
> > > Is it really useful to write only error message to the system catalog?
> > > Even if we see the error message like "duplicate key value violates
> > > unique constraint “test_tab_pkey”” on the system catalog, we will end
> > > up needing to check the server log for details to properly resolve the
> > > conflict. If the user wants to know whether the subscription is
> > > disabled manually or automatically, the error message on the system
> > > catalog might not necessarily be necessary.
> > >
>
> I think the two key points are (a) to define exactly what all
> information is required to be logged on error,

When it comes to the patch for skipping transactions, it would
somewhat depend on how users specify transactions to skip. On the
other hand, for this patch, the minimal information would be whether
the subscription is disabled automatically by the server.

> (b) where do we want to
> store the information based on requirements. I see that for (b) Mark
> is inclined to use the existing catalog table. I feel that is worth
> considering but not sure if that is the best way to deal with it. For
> example, if we store that information in the catalog, we might need to
> consider storing it both in pg_subscription and pg_subscription_rel,
> otherwise, we might overwrite the errors as I think what is happening
> in the currently proposed patch. The other possibilities could be to
> define a new catalog table to capture the error information or log the
> required information via stats collector and then the user can see
> that info via some stats view.

This point is also related to the point whether or not that
information needs to last after the server crash (and restart). When
it comes to the patch for skipping transactions, there was a
discussion that we don’t necessarily need it since the tools will be
used in rare cases. But for this proposed patch, I guess it would be
useful if it does. It might be worth considering doing a different way
for each patch. For example, we send the details of last failure
transaction to the stats collector while updating subenabled to
something like “automatically-disabled” instead of to just “false” (or
using another column to show the subscriber is disabled automatically
by the server).

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




> On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Because currently, we don't proceed after an error unless it is
> resolved. Why do you think there could be multiple such transactions?

Just as one example, if the subscriber has a unique index that the publisher lacks, any number of transactions could
addnon-unique data that then fails to apply on the subscriber.  My patch took the view that the user should figure out
howto get the subscriber side consistent with the publisher side, but if you instead take the approach that problematic
commitsshould be skipped, it would seem that arbitrarily many such transactions could be committed on the publisher
side.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






On Mon, Jun 21, 2021 at 10:24 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> > On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Because currently, we don't proceed after an error unless it is
> > resolved. Why do you think there could be multiple such transactions?
>
> Just as one example, if the subscriber has a unique index that the publisher lacks, any number of transactions could
addnon-unique data that then fails to apply on the subscriber.
 
>

Then also it will fail on the first such conflict, so even without
your patch, the apply worker corresponding to the subscription won't
be able to proceed after the first error, it won't lead to multiple
failing xids. However, I see a different case where there could be
multiple failing xids and that can happen during initial table sync
where multiple workers failed due to some error. I am not sure your
patch would be able to capture all such failed transactions because
you are recording this information in pg_subscription and not in
pg_subscription_rel.

-- 
With Regards,
Amit Kapila.




> On Jun 20, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> (a) to define exactly what all
> information is required to be logged on error, (b) where do we want to
> store the information based on requirements.

I'm not sure it has to be stored anywhere durable.  I have a patch in the works to do something like:

create function foreign_key_insert_violation_before() returns conflict_trigger as $$
BEGIN
    RAISE NOTICE 'elevel: %', TG_ELEVEL:
    RAISE NOTICE 'sqlerrcode: %', TG_SQLERRCODE:
    RAISE NOTICE 'message: %', TG_MESSAGE:
    RAISE NOTICE 'detail: %', TG_DETAIL:
    RAISE NOTICE 'detail_log: %', TG_DETAIL_LOG:
    RAISE NOTICE 'hint: %', TG_HINT:
    RAISE NOTICE 'schema: %', TG_SCHEMA_NAME:
    RAISE NOTICE 'table: %', TG_TABLE_NAME:
    RAISE NOTICE 'column: %', TG_COLUMN_NAME:
    RAISE NOTICE 'datatype: %', TG_DATATYPE_NAME:
    RAISE NOTICE 'constraint: %', TG_CONSTRAINT_NAME:

        -- do something useful to prepare for retry of transaction
        -- which raised a foreign key violation
END
$$ language plpgsql;

create function foreign_key_insert_violation_after() returns conflict_trigger as $$
BEGIN
        -- do something useful to cleanup after retry of transaction
        -- which raised a foreign key violation
END
$$ language plpgsql;

create conflict trigger regress_conflict_trigger_insert on regress_conflictsub
   before foreign_key_violation
   when tag in ('INSERT')
   execute procedure foreign_key_insert_violation_before();

create conflict trigger regress_conflict_trigger_insert on regress_conflictsub
   after foreign_key_violation
   when tag in ('INSERT')
   execute procedure foreign_key_insert_violation_after();

The idea is that, for subscriptions that have conflict triggers defined, the apply will be wrapped in a
PG_TRY()/PG_CATCH()block.  If it fails, the ErrorData will be copied in the ConflictTriggerContext, and then the
transactionwill be attempted again, but this time with any BEFORE and AFTER triggers applied.  The triggers could then
returna special result indicating whether the transaction should be permanently skipped, applied, or whatever.  None of
thedata needs to be stored anywhere non-transient, as it just gets handed to the triggers. 

I think the other patch is a subset of this functionality, as using this system to create triggers which query a table
containingtransactions to be skipped would be enough to get the functionality you've been discussing.  But this system
couldalso do other things, like modify data.  Admittedly, this is akin to a statement level trigger and not a row level
trigger,so a number of things you might want to do would be hard to do from this.  But perhaps the equivalent of row
leveltriggers could also be written? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







> On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Then also it will fail on the first such conflict, so even without
> your patch, the apply worker corresponding to the subscription won't
> be able to proceed after the first error, it won't lead to multiple
> failing xids.

I'm not sure we're talking about the same thing.  I'm saying that if the user is expected to clear each error manually,
therecould be many such errors for them to clear.  It may be true that the second error doesn't occur on the subscriber
sideuntil after the first is cleared, but that still leaves the user having to clear one after the next until
arbitrarilymany of them coming from the publisher side are cleared. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







> On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> However, I see a different case where there could be
> multiple failing xids and that can happen during initial table sync
> where multiple workers failed due to some error. I am not sure your
> patch would be able to capture all such failed transactions because
> you are recording this information in pg_subscription and not in
> pg_subscription_rel.

Right, I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went wrong.
Mypatch was designed around the idea that the user would need to figure out how to fix the subscriber side prior to
re-enablingthe subscription.  As such, I wasn't bothered with trying to store everything, just enough to give the user
aclue where to look.  I don't mind if you want to store more information, and maybe that needs to be stored somewhere
else. Do you believe pg_subscription_rel is a suitable location?  

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






On Mon, Jun 21, 2021 at 9:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jun 21, 2021 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 21, 2021 at 7:56 AM Mark Dilger
> > <mark.dilger@enterprisedb.com> wrote:
> > >
> > > > On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I will submit the patch.
> > >
> > > Great, thanks!
> > >
> > > > There was a discussion that the skipping transaction patch would also
> > > > need to have a feature that tells users the details of the last
> > > > failure transaction such as its XID, timestamp, action etc. In that
> > > > sense, those two patches might need the common infrastructure that the
> > > > apply workers leave the error details somewhere so that the users can
> > > > see it.
> > >
> > > Right.  Subscription on error triggers would need that, too, if we wrote them.
> > >
> > > > Is it really useful to write only error message to the system catalog?
> > > > Even if we see the error message like "duplicate key value violates
> > > > unique constraint “test_tab_pkey”” on the system catalog, we will end
> > > > up needing to check the server log for details to properly resolve the
> > > > conflict. If the user wants to know whether the subscription is
> > > > disabled manually or automatically, the error message on the system
> > > > catalog might not necessarily be necessary.
> > > >
> >
> > I think the two key points are (a) to define exactly what all
> > information is required to be logged on error,
>
> When it comes to the patch for skipping transactions, it would
> somewhat depend on how users specify transactions to skip. On the
> other hand, for this patch, the minimal information would be whether
> the subscription is disabled automatically by the server.
>

True, but still there will be some information related to ERROR which
we wanted the user to see unless we ask them to refer to logs for
that.

> > (b) where do we want to
> > store the information based on requirements. I see that for (b) Mark
> > is inclined to use the existing catalog table. I feel that is worth
> > considering but not sure if that is the best way to deal with it. For
> > example, if we store that information in the catalog, we might need to
> > consider storing it both in pg_subscription and pg_subscription_rel,
> > otherwise, we might overwrite the errors as I think what is happening
> > in the currently proposed patch. The other possibilities could be to
> > define a new catalog table to capture the error information or log the
> > required information via stats collector and then the user can see
> > that info via some stats view.
>
> This point is also related to the point whether or not that
> information needs to last after the server crash (and restart). When
> it comes to the patch for skipping transactions, there was a
> discussion that we don’t necessarily need it since the tools will be
> used in rare cases. But for this proposed patch, I guess it would be
> useful if it does. It might be worth considering doing a different way
> for each patch. For example, we send the details of last failure
> transaction to the stats collector while updating subenabled to
> something like “automatically-disabled” instead of to just “false” (or
> using another column to show the subscriber is disabled automatically
> by the server).
>

I agree that it is worth considering to have subenabled to have a
tri-state (enable, disabled, automatically-disabled) value instead of
just a boolean. But in this case, if the stats collector missed
updating the information, the user may have to manually update the
subscription and let the error happen again to see it.

--
With Regards,
Amit Kapila.



On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> > On Jun 20, 2021, at 10:11 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > However, I see a different case where there could be
> > multiple failing xids and that can happen during initial table sync
> > where multiple workers failed due to some error. I am not sure your
> > patch would be able to capture all such failed transactions because
> > you are recording this information in pg_subscription and not in
> > pg_subscription_rel.
>
> Right, I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went
wrong. My patch was designed around the idea that the user would need to figure out how to fix the subscriber side
priorto re-enabling the subscription.  As such, I wasn't bothered with trying to store everything, just enough to give
theuser a clue where to look. 
>

Okay, but the clue will be pretty random because you might end up just
logging one out of several errors.

>  I don't mind if you want to store more information, and maybe that needs to be stored somewhere else.  Do you
believepg_subscription_rel is a suitable location? 
>
It won't be sufficient to store information in either
pg_subscription_rel or pg_susbscription. I think if we want to store
the required information in a catalog then we need to define a new
catalog (pg_subscription_conflicts or something like that) with
information corresponding to each rel in subscription (srsubid oid
(Reference to subscription), srrelid oid (Reference to relation),
<columns for error_info>). OTOH, we can choose to send the error
information to stats collector which will then be available via stat
view and update system catalog to disable the subscription but there
will be a risk that we might send info of failed transaction to stats
collector but then fail to update system catalog to disable the
subscription.

--
With Regards,
Amit Kapila.



On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>
> >  I don't mind if you want to store more information, and maybe that needs to be stored somewhere else.  Do you
believepg_subscription_rel is a suitable location?
 
> >
> It won't be sufficient to store information in either
> pg_subscription_rel or pg_susbscription. I think if we want to store
> the required information in a catalog then we need to define a new
> catalog (pg_subscription_conflicts or something like that) with
> information corresponding to each rel in subscription (srsubid oid
> (Reference to subscription), srrelid oid (Reference to relation),
> <columns for error_info>). OTOH, we can choose to send the error
> information to stats collector which will then be available via stat
> view and update system catalog to disable the subscription but there
> will be a risk that we might send info of failed transaction to stats
> collector but then fail to update system catalog to disable the
> subscription.
>

I think we should store the input from the user (like disable_on_error
flag or xid to skip) in the system catalog pg_subscription and send
the error information (subscrtion_id, rel_id, xid of failed xact,
error_code, error_message, etc.) to the stats collector which can be
used to display such information via a stat view.

The disable_on_error flag handling could be that on error it sends the
required error info to stats collector and then updates the subenabled
in pg_subscription. In rare conditions, where we are able to send the
message but couldn't update the subenabled info in pg_subscription
either due to some error or server restart, the apply worker would
again try to apply the same change and would hit the same error again
which I think should be fine because it will ultimately succeed.

The skip xid handling will also be somewhat similar where on an error,
we will send the error information to stats collector which will be
displayed via stats view. Then the user is expected to ask for skip
xid (Alter Subscription ... SKIP <xid_value>) based on information
displayed via stat view. Now, the apply worker can skip changes from
such a transaction, and then during processing of commit record of the
skipped transaction, it should update xid to invalid value, so that
next time that shouldn't be used. I think it is important to update
xid to an invalid value as part of the skipped transaction because
otherwise, after the restart, we won't be able to decide whether we
still want to skip the xid stored for a subscription.

-- 
With Regards,
Amit Kapila.



Much of the discussion above seems to be related to where to store the
error information and how much information is needed to be useful.

As a summary, the 5 alternatives I have seen mentioned are:

#1. Store some simple message in the pg_subscription ("I wasn't trying
to capture everything, just enough to give the user a reasonable
indication of what went wrong" [Mark-1]). Storing the error message
was also seen as a convenience for writing TAP tests ("I originally
ran into the motivation to write this patch when frustrated that TAP
tests needed to parse the apply worker log file" [Mark-2}). It also
can sometimes provide a simple clue for the error (e.g. PK violation
for table TBL) but still the user will have to look elsewhere for
details to resolve the error. So while this implementation seems good
for simple scenarios, it appears to have been shot down because the
non-trivial scenarios either have insufficient or wrong information in
the error message. Some DETAILS could have been added to give more
information but that would maybe bloat the catalog ("I have not yet
included the DETAIL field because I didn't want to bloat the catalog."
[Mark-3])

#2. Similarly another idea was to use another existing catalog table
pg_subscription_rel. This could have the same problems ("It won't be
sufficient to store information in either pg_subscription_rel or
pg_susbscription." [Amit-1])

#3. There is another suggestion to use the Stats Collector to hold the
error message [Amit-2]. For me, this felt like blurring too much the
distinction between "stats tracking/metrics" and "logs". ERROR logs
must be flushed, whereas for stats (IIUC) there is no guarantee that
everything you need to see would be present. Indeed Amit wrote "But in
this case, if the stats collector missed updating the information, the
user may have to manually update the subscription and let the error
happen again to see it." [Amit-3]. Requesting the user to cause the
same error again just in case it was not captured a first time seems
too strange to me.

#4. The next idea was to have an entirely new catalog for holding the
subscription error information. I feel that storing/duplicating lots
of error information in another table seems like a bridge too far.
What about the risks of storing incorrect or sufficient information?
What is the advantage of duplicating errors over just referring to the
log files for ERROR details?

#5. Document to refer to the logs. All ERROR details are already in
the logs, and this seems to me the intuitive place to look for them.
Searching for specific errors becomes difficult programmatically (is
this really a problem other than complex TAP tests?). But here there
is no risk of missing or insufficient information captured in the log
files ("but still there will be some information related to ERROR
which we wanted the user to see unless we ask them to refer to logs
for that." [Amit-4}).

---

My preferred alternative is #5. ERRORs are logged in the log file, so
there is nothing really for this patch to do in this regard (except
documentation), and there is no risk of missing any information, no
ambiguity of having duplicated errors, and it is the intuitive place
the user would look.

So I felt current best combination is just this:
a) A tri-state indicating the state of the subscription: e.g.
something like "enabled" ('e')/ "disabled" ('d') / "auto-disabled"
('a') [Amit-5]
b) For "auto-disabled" the PG docs would be updated tell the user to
check the logs to resolve the problem before re-enabling the
subscription

//////////

IMO it is not made exactly clear to me what is the main goal of this
patch. Because of this, I feel that you can't really judge if this new
option is actually useful or not except only in hindsight. It seems
like whatever you implement can be made to look good or bad, just by
citing different test scenarios.

e.g.

* Is the goal mainly to help automated (TAP) testing? In that case,
then maybe you do want to store the error message somewhere other than
the log files. But still I wonder if results would be unpredictable
anyway - e.g if there are multiple tables all with errors then it
depends on the tablesync order of execution which error you see caused
the auto-disable, right? And if it is not predictable maybe it is less
useful.

* Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
at some point in future and then going into a relaunch loop for
days/weeks and causing 1000's of errors without the user noticing. In
that case, this patch seems to be quite useful, but for this goal
maybe you don't want to be checking the tablesync workers at all, but
should only be checking the apply worker like your original v1 patch
did.

* Is the goal just to be a convenient way to disable the subscription
during the CREATE SUBSCRIPTION phase so that the user can make
corrections in peace without the workers re-launching and making more
error logs? Here the patch is helpful, but only for simple scenarios
like 1 faulty table. Imagine if there are 10 tables (all with PK
violations at DATASYNC copy) then you will encounter them one at a
time and have to re-enable the subscription 10 times, after fixing
each error in turn. So in this scenario the new option might be more
of a hindrance than a help because it would be easier if the user just
did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
problems in one sitting before re-enabling.

* etc

//////////

Finally, here is one last (crazy?) thought-bubble just for
consideration. I might be wrong, but my gut feeling is that the Stats
Collector is intended more for "tracking" and for "metrics" rather
than for holding duplicates of logged error messages. At the same
time, I felt that disabling an entire subscription due to a single
rogue error might be overkill sometimes. But I wonder if there is a
way to combine those two ideas so that the Stats Collector gets some
new counter for tracking the number of worker re-launches that have
occurred, meanwhile there could be a subscription option which gives a
threshold above which you would disable the subscription.
e.g.
"disable_on_error_threshold=0" default, relaunch forever
"disable_on_error_threshold=1" disable upon first error encountered.
(This is how your patch behaves now I think.)
"disable_on_error_threshold=500" disable if the re-launch errors go
unattended and happen 500 times.

------
[Mark-1] https://www.postgresql.org/message-id/A539C848-670E-454F-B31C-82D3CBE9F5AC%40enterprisedb.com
[Mark-2] https://www.postgresql.org/message-id/DB35438F-9356-4841-89A0-412709EBD3AB%40enterprisedb.com
[Mark-3] https://www.postgresql.org/message-id/DE7E13B7-DC76-416A-A98F-3BC3F80E6BE9%40enterprisedb.com
[Amit-1]
https://www.postgresql.org/message-id/CAA4eK1K_JFSFrAkr_fgp3VX6hTSmjK%3DwNs4Tw8rUWHGp0%2BBsaw%40mail.gmail.com
[Amit-2] https://www.postgresql.org/message-id/CAA4eK1%2BNoRbYSH1J08zi4OJ_EUMcjmxTwnmwVqZ6e_xzS0D6VA%40mail.gmail.com
[Amit-3]
https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com
[Amit-4]
https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com
[Amit-5]
https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




> On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250@gmail.com> wrote:
>
> #5. Document to refer to the logs. All ERROR details are already in
> the logs, and this seems to me the intuitive place to look for them.

My original motivation came from writing TAP tests to check that the permissions systems would properly deny the apply
workerwhen running under a non-superuser role.  The idea is that the user with the responsibility for managing
subscriptionswon't have enough privilege to read the logs.  Whatever information that user needs (if any) must be
someplaceelse. 

> Searching for specific errors becomes difficult programmatically (is
> this really a problem other than complex TAP tests?).

I believe there is a problem, because I remain skeptical that these errors will be both existent and rare.  Either
you'veconfigured your system correctly and you get zero of these, or you've misconfigured it and you get some non-zero
numberof them.  I don't see any reason to assume that number will be small. 

The best way to deal with that is to be able to tell the system what to do with them, like "if the error has this error
codeand the error message matches this regular expression, then do this, else do that."  That's why I think allowing
triggersto be created on subscriptions makes the most sense (though is probably the hardest system being proposed so
far.)

> But here there
> is no risk of missing or insufficient information captured in the log
> files ("but still there will be some information related to ERROR
> which we wanted the user to see unless we ask them to refer to logs
> for that." [Amit-4}).

Not only is there a problem if the user doesn't have permission to view the logs, but also, if we automatically disable
thesubscription until the error is manually cleared, the logs might be rotated out of existence before the user takes
anyaction.  In that case, the logs will be entirely missing, and not even the error message will remain.  At least with
thepatch I submitted, the error message will remain, though I take Amit's point that there are deficiencies in handling
paralleltablesync workers, etc. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







> On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250@gmail.com> wrote:
>
> * Is the goal mainly to help automated (TAP) testing?

Absolutely, that was my original motivation.  But I don't think that is the primary reason the patch would be accepted.
There is a cost to having the logical replication workers attempt ad infinitum to apply a transaction that will never
apply.

Also, if you are waiting for a subscription to catch up, it is far from obvious that you will wait forever.

> In that case,
> then maybe you do want to store the error message somewhere other than
> the log files. But still I wonder if results would be unpredictable
> anyway - e.g if there are multiple tables all with errors then it
> depends on the tablesync order of execution which error you see caused
> the auto-disable, right? And if it is not predictable maybe it is less
> useful.

But if you are writing a TAP test, you should be the one controlling whether that is the case.  I don't think it would
beunpredictable from the point of view of the test author. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Mon, Jun 21, 2021 at 7:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger
> > <mark.dilger@enterprisedb.com> wrote:
> >
> > >  I don't mind if you want to store more information, and maybe that needs to be stored somewhere else.  Do you
believepg_subscription_rel is a suitable location?
 
> > >
> > It won't be sufficient to store information in either
> > pg_subscription_rel or pg_susbscription. I think if we want to store
> > the required information in a catalog then we need to define a new
> > catalog (pg_subscription_conflicts or something like that) with
> > information corresponding to each rel in subscription (srsubid oid
> > (Reference to subscription), srrelid oid (Reference to relation),
> > <columns for error_info>). OTOH, we can choose to send the error
> > information to stats collector which will then be available via stat
> > view and update system catalog to disable the subscription but there
> > will be a risk that we might send info of failed transaction to stats
> > collector but then fail to update system catalog to disable the
> > subscription.
> >
>
> I think we should store the input from the user (like disable_on_error
> flag or xid to skip) in the system catalog pg_subscription and send
> the error information (subscrtion_id, rel_id, xid of failed xact,
> error_code, error_message, etc.) to the stats collector which can be
> used to display such information via a stat view.
>
> The disable_on_error flag handling could be that on error it sends the
> required error info to stats collector and then updates the subenabled
> in pg_subscription. In rare conditions, where we are able to send the
> message but couldn't update the subenabled info in pg_subscription
> either due to some error or server restart, the apply worker would
> again try to apply the same change and would hit the same error again
> which I think should be fine because it will ultimately succeed.
>
> The skip xid handling will also be somewhat similar where on an error,
> we will send the error information to stats collector which will be
> displayed via stats view. Then the user is expected to ask for skip
> xid (Alter Subscription ... SKIP <xid_value>) based on information
> displayed via stat view. Now, the apply worker can skip changes from
> such a transaction, and then during processing of commit record of the
> skipped transaction, it should update xid to invalid value, so that
> next time that shouldn't be used. I think it is important to update
> xid to an invalid value as part of the skipped transaction because
> otherwise, after the restart, we won't be able to decide whether we
> still want to skip the xid stored for a subscription.

Sounds reasonable.

The feature that sends the error information to the stats collector is
a common feature for both and itself is also useful. As discussed in
that skip transaction patch thread, it would also be good if we write
error information (relation, action, xid, etc) to the server log too.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



On Tue, Jun 22, 2021 at 6:27 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> #3. There is another suggestion to use the Stats Collector to hold the
> error message [Amit-2]. For me, this felt like blurring too much the
> distinction between "stats tracking/metrics" and "logs". ERROR logs
> must be flushed, whereas for stats (IIUC) there is no guarantee that
> everything you need to see would be present. Indeed Amit wrote "But in
> this case, if the stats collector missed updating the information, the
> user may have to manually update the subscription and let the error
> happen again to see it." [Amit-3]. Requesting the user to cause the
> same error again just in case it was not captured a first time seems
> too strange to me.
>

I don't think it will often be the case that the stats collector will
miss updating the information. I am not feeling comfortable storing
error information in system catalogs. We have some other views which
capture somewhat similar conflict information
(pg_stat_database_conflicts) or failed transactions information. So, I
thought here we are extending the similar concept by storing some
additional information about errors.

-- 
With Regards,
Amit Kapila.




> On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250@gmail.com> wrote:
>
> * Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
> at some point in future and then going into a relaunch loop for
> days/weeks and causing 1000's of errors without the user noticing. In
> that case, this patch seems to be quite useful, but for this goal
> maybe you don't want to be checking the tablesync workers at all, but
> should only be checking the apply worker like your original v1 patch
> did.

Yeah, my motivation was preventing an infinite loop, and providing a clean way for the users to know that replication
theyare waiting for won't ever complete, rather than having to infer that it will never halt.  

> * Is the goal just to be a convenient way to disable the subscription
> during the CREATE SUBSCRIPTION phase so that the user can make
> corrections in peace without the workers re-launching and making more
> error logs?

No.  This is not and never was my motivation.  It's an interesting question, but that idea never crossed my mind.  I'm
notsure what changes somebody would want to make *after* creating the subscription.  Certainly, there may be problems
withhow they have things set up, but they won't know that until the first error happens. 

> Here the patch is helpful, but only for simple scenarios
> like 1 faulty table. Imagine if there are 10 tables (all with PK
> violations at DATASYNC copy) then you will encounter them one at a
> time and have to re-enable the subscription 10 times, after fixing
> each error in turn.

You are assuming disable_on_error=true.  It is false by default.  But ok, let's accept that assumption for the sake of
argument. Now, will you have to manually go through the process 10 times?  I'm not sure.  The user might figure out
theirmistake after seeing the first error. 

> So in this scenario the new option might be more
> of a hindrance than a help because it would be easier if the user just
> did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
> problems in one sitting before re-enabling.

Yeah, but since the new option is off by default, I don't see any sensible complaint.

>
> * etc
>
> //////////
>
> Finally, here is one last (crazy?) thought-bubble just for
> consideration. I might be wrong, but my gut feeling is that the Stats
> Collector is intended more for "tracking" and for "metrics" rather
> than for holding duplicates of logged error messages. At the same
> time, I felt that disabling an entire subscription due to a single
> rogue error might be overkill sometimes.

I'm happy to entertain criticism of the particulars of how my patch approaches this problem, but it is already making a
distinctionbetween transient errors (resources, network, etc.) vs. ones that are non-transient.  Again, I might not
havedrawn the line in the right place, but the patch is not intended to disable subscriptions in response to transient
errors.

> But I wonder if there is a
> way to combine those two ideas so that the Stats Collector gets some
> new counter for tracking the number of worker re-launches that have
> occurred, meanwhile there could be a subscription option which gives a
> threshold above which you would disable the subscription.
> e.g.
> "disable_on_error_threshold=0" default, relaunch forever
> "disable_on_error_threshold=1" disable upon first error encountered.
> (This is how your patch behaves now I think.)
> "disable_on_error_threshold=500" disable if the re-launch errors go
> unattended and happen 500 times.

That sounds like a misfeature to me.  You could have a subscription that works fine for a month, surviving numerous
shortnetwork outages, but then gets autodisabled after a longer network outage.  I'm not sure why anybody would want
that. You might argue for exponential backoff, where it never gets autodisabled on transient errors, but retries less
frequently. But I don't want to expand the scope of this patch to include that, at least not without a lot more
evidencethat it is needed. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






On Mon, Jun 21, 2021 at 4:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> I think we should store the input from the user (like disable_on_error
> flag or xid to skip) in the system catalog pg_subscription and send
> the error information (subscrtion_id, rel_id, xid of failed xact,
> error_code, error_message, etc.) to the stats collector which can be
> used to display such information via a stat view.
>
> The disable_on_error flag handling could be that on error it sends the
> required error info to stats collector and then updates the subenabled
> in pg_subscription. In rare conditions, where we are able to send the
> message but couldn't update the subenabled info in pg_subscription
> either due to some error or server restart, the apply worker would
> again try to apply the same change and would hit the same error again
> which I think should be fine because it will ultimately succeed.
>
> The skip xid handling will also be somewhat similar where on an error,
> we will send the error information to stats collector which will be
> displayed via stats view. Then the user is expected to ask for skip
> xid (Alter Subscription ... SKIP <xid_value>) based on information
> displayed via stat view. Now, the apply worker can skip changes from
> such a transaction, and then during processing of commit record of the
> skipped transaction, it should update xid to invalid value, so that
> next time that shouldn't be used. I think it is important to update
> xid to an invalid value as part of the skipped transaction because
> otherwise, after the restart, we won't be able to decide whether we
> still want to skip the xid stored for a subscription.
>

One minor detail I missed in the above sketch for skipped transaction
feature was that actually we only need replication origin state from
the commit record of the skipped transaction and then I think we need
to start a transaction, update the xid value to invalid, set the
replication origin state and commit that transaction.

-- 
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Mon, Jun 21, 2021 at 11:26 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jun 20, 2021, at 7:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I will submit the patch.
>
> Great, thanks!

I've submitted the patches on that thread[1]. There are three patches:
skipping the transaction on the subscriber side, reporting error
details in the errcontext, and reporting the error details to the
stats collector. Feedback is very welcome.

[1] https://www.postgresql.org/message-id/CAD21AoBU4jGEO6AXcykQ9y7tat0RrB5--8ZoJgfcj%2BLPs7nFZQ%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, June 28, 2021 1:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Jun 21, 2021 at 11:26 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> > > On Jun 20, 2021, at 7:17 PM, Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > I will submit the patch.
> >
> > Great, thanks!
> 
> I've submitted the patches on that thread[1]. There are three patches:
> skipping the transaction on the subscriber side, reporting error details in the
> errcontext, and reporting the error details to the stats collector. Feedback is
> very welcome.
> 
> [1]
> https://www.postgresql.org/message-id/CAD21AoBU4jGEO6AXcykQ9y7tat0R
> rB5--8ZoJgfcj%2BLPs7nFZQ%40mail.gmail.com
Hi, thanks Sawada-san for keep updating the skip xid patch in the thread.

This thread has stopped since the patch submission.
I've rebased the 'disable_on_error' option
so that it can be applied on top of skip xid shared in [1].
I've written Mark Dilger as the original author in the commit message.

This patch is simply rebased to reactive this thread.
So there are still pending item to discuss for example,
how we should deal with multiple errors of several table sync workers.

I extracted only 'disable_on_error' option
because the skip xid and the latest error message fulfill the motivation
to make it easy to write TAP tests already I felt.

[1] - https://www.postgresql.org/message-id/CAD21AoDY-9_x819F_m1_wfCVXXFJrGiSmR2MfC9Nw4nW8Om0qA%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Attachment
On Tue, Nov 2, 2021 at 4:12 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, June 28, 2021 1:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Mon, Jun 21, 2021 at 11:26 AM Mark Dilger
> > <mark.dilger@enterprisedb.com> wrote:
> > > > On Jun 20, 2021, at 7:17 PM, Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I will submit the patch.
> > >
> > > Great, thanks!
> >
> > I've submitted the patches on that thread[1]. There are three patches:
> > skipping the transaction on the subscriber side, reporting error details in the
> > errcontext, and reporting the error details to the stats collector. Feedback is
> > very welcome.
> >
> > [1]
> > https://www.postgresql.org/message-id/CAD21AoBU4jGEO6AXcykQ9y7tat0R
> > rB5--8ZoJgfcj%2BLPs7nFZQ%40mail.gmail.com
> Hi, thanks Sawada-san for keep updating the skip xid patch in the thread.
>
> This thread has stopped since the patch submission.
> I've rebased the 'disable_on_error' option
> so that it can be applied on top of skip xid shared in [1].
> I've written Mark Dilger as the original author in the commit message.
>
> This patch is simply rebased to reactive this thread.
> So there are still pending item to discuss for example,
> how we should deal with multiple errors of several table sync workers.
>
> I extracted only 'disable_on_error' option
> because the skip xid and the latest error message fulfill the motivation
> to make it easy to write TAP tests already I felt.
>

Thanks for the updated patch. Please create a Commitfest entry for
this. It will help to have a look at CFBot results for the patch, also
if required rebase and post a patch on top of Head.

Regards,
Vignesh



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com> wrote:
> Thanks for the updated patch. Please create a Commitfest entry for this. It will
> help to have a look at CFBot results for the patch, also if required rebase and
> post a patch on top of Head.
As requested, created a new entry for this - [1]

FYI: the skip xid patch has been updated to v20 in [2]
but the v3 for disable_on_error is not affected by this update
and still applicable with no regression.

[1] - https://commitfest.postgresql.org/36/3407/
[2] - https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2BEUHbZk8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Wed, Nov 10, 2021 at 12:26 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com> wrote:
> > Thanks for the updated patch. Please create a Commitfest entry for this. It will
> > help to have a look at CFBot results for the patch, also if required rebase and
> > post a patch on top of Head.
> As requested, created a new entry for this - [1]
>
> FYI: the skip xid patch has been updated to v20 in [2]
> but the v3 for disable_on_error is not affected by this update
> and still applicable with no regression.
>
> [1] - https://commitfest.postgresql.org/36/3407/
> [2] - https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2BEUHbZk8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com
>

I had a look at this patch and have a couple of initial review
comments for some issues I spotted:

src/backend/commands/subscriptioncmds.c
(1) bad array entry assignment
The following code block added by the patch assigns
"values[Anum_pg_subscription_subdisableonerr - 1]" twice, resulting in
it being always set to true, rather than the specified option value:

+  if (IsSet(opts.specified_opts, SUBOPT_DISABLE_ON_ERR))
+  {
+    values[Anum_pg_subscription_subdisableonerr - 1]
+       = BoolGetDatum(opts.disableonerr);
+     values[Anum_pg_subscription_subdisableonerr - 1]
+       = true;
+  }

The 2nd line is meant to instead be
"replaces[Anum_pg_subscription_subdisableonerr - 1] = true".
(compare to handling for other similar options)

src/backend/replication/logical/worker.c
(2) unreachable code?
In the patch code there seems to be some instances of unreachable code
after re-throwing errors:

e.g.

+ /* If we caught an error above, disable the subscription */
+ if (disable_subscription)
+ {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }

+ else
+ {
+   PG_RE_THROW();
+   MemoryContextSwitchTo(ecxt);
+ }


+ if (disable_subscription)
+ {
+   ReThrowError(DisableSubscriptionOnError(cctx));
+   MemoryContextSwitchTo(ecxt);
+ }

I'm guessing it was intended to do the "MemoryContextSwitch(ecxt);"
before re-throwing (?), but it's not really clear, as in the 1st and
3rd cases, the DisableSubscriptionOnError() calls anyway immediately
switch the memory context to cctx.


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:

On Wed, Nov 10, 2021 at 3:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> I had a look at this patch and have a couple of initial review
> comments for some issues I spotted:
>

Incidentally, I found that the v3 patch only applies after the skip xid v20 patch [1] has been applied.

[2] - https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2BEUHbZk8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com


Regards,
Greg Nancarrow
Fujitsu Australia

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, November 10, 2021 1:23 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Wed, Nov 10, 2021 at 12:26 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com>
> wrote:
> > > Thanks for the updated patch. Please create a Commitfest entry for
> > > this. It will help to have a look at CFBot results for the patch,
> > > also if required rebase and post a patch on top of Head.
> > As requested, created a new entry for this - [1]
> >
> > FYI: the skip xid patch has been updated to v20 in [2] but the v3 for
> > disable_on_error is not affected by this update and still applicable
> > with no regression.
> >
> > [1] - https://commitfest.postgresql.org/36/3407/
> > [2] -
> >
> https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2B
> EUHbZ
> > k8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com
> 
> I had a look at this patch and have a couple of initial review comments for some
> issues I spotted:
Thank you for checking it.


> src/backend/commands/subscriptioncmds.c
> (1) bad array entry assignment
> The following code block added by the patch assigns
> "values[Anum_pg_subscription_subdisableonerr - 1]" twice, resulting in it
> being always set to true, rather than the specified option value:
> 
> +  if (IsSet(opts.specified_opts, SUBOPT_DISABLE_ON_ERR))  {
> +    values[Anum_pg_subscription_subdisableonerr - 1]
> +       = BoolGetDatum(opts.disableonerr);
> +     values[Anum_pg_subscription_subdisableonerr - 1]
> +       = true;
> +  }
> 
> The 2nd line is meant to instead be
> "replaces[Anum_pg_subscription_subdisableonerr - 1] = true".
> (compare to handling for other similar options)
Oops, fixed.
 
> src/backend/replication/logical/worker.c
> (2) unreachable code?
> In the patch code there seems to be some instances of unreachable code after
> re-throwing errors:
> 
> e.g.
> 
> + /* If we caught an error above, disable the subscription */ if
> + (disable_subscription) {
> +   ReThrowError(DisableSubscriptionOnError(cctx));
> +   MemoryContextSwitchTo(ecxt);
> + }
> 
> + else
> + {
> +   PG_RE_THROW();
> +   MemoryContextSwitchTo(ecxt);
> + }
> 
> 
> + if (disable_subscription)
> + {
> +   ReThrowError(DisableSubscriptionOnError(cctx));
> +   MemoryContextSwitchTo(ecxt);
> + }
> 
> I'm guessing it was intended to do the "MemoryContextSwitch(ecxt);"
> before re-throwing (?), but it's not really clear, as in the 1st and 3rd cases, the
> DisableSubscriptionOnError() calls anyway immediately switch the memory
> context to cctx.
You are right I think.
Fixed based on an idea below.

After an error happens, for some additional work
(e.g. to report the stats of table sync/apply worker
by pgstat_report_subworker_error() or
to update the catalog by DisableSubscriptionOnError())
restore the memory context that is used before the error (cctx)
and save the old memory context of error (ecxt). Then,
do the additional work and switch the memory context to the ecxt
just before the rethrow. As you described, 
in contrast to PG_RE_THROW, DisableSubscriptionOnError() changes
the memory context immediatedly at the top of it,
so for this case, I don't call the MemoryContextSwitchTo().

Another important thing as my modification
is a case when LogicalRepApplyLoop failed and
apply_error_callback_arg.command == 0. In the original
patch of skip xid, it just calls PG_RE_THROW()
but my previous v3 codes missed this macro in this case.
Therefore, I've fixed this part as well.

C codes are checked by pgindent.

Note that this depends on the v20 skip xide patch in [1]

[1] - https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2BEUHbZk8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Attachment
On Thu, Nov 11, 2021 at 2:50 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Wednesday, November 10, 2021 1:23 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > On Wed, Nov 10, 2021 at 12:26 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Monday, November 8, 2021 10:15 PM vignesh C <vignesh21@gmail.com>
> > wrote:
> > > > Thanks for the updated patch. Please create a Commitfest entry for
> > > > this. It will help to have a look at CFBot results for the patch,
> > > > also if required rebase and post a patch on top of Head.
> > > As requested, created a new entry for this - [1]
> > >
> > > FYI: the skip xid patch has been updated to v20 in [2] but the v3 for
> > > disable_on_error is not affected by this update and still applicable
> > > with no regression.
> > >
> > > [1] - https://commitfest.postgresql.org/36/3407/
> > > [2] -
> > >
> > https://www.postgresql.org/message-id/CAD21AoAT42mhcqeB1jPfRL1%2B
> > EUHbZ
> > > k8MMY_fBgsyZvJeKNpG%2Bw%40mail.gmail.com
> >
> > I had a look at this patch and have a couple of initial review comments for some
> > issues I spotted:
> Thank you for checking it.
>
>
> > src/backend/commands/subscriptioncmds.c
> > (1) bad array entry assignment
> > The following code block added by the patch assigns
> > "values[Anum_pg_subscription_subdisableonerr - 1]" twice, resulting in it
> > being always set to true, rather than the specified option value:
> >
> > +  if (IsSet(opts.specified_opts, SUBOPT_DISABLE_ON_ERR))  {
> > +    values[Anum_pg_subscription_subdisableonerr - 1]
> > +       = BoolGetDatum(opts.disableonerr);
> > +     values[Anum_pg_subscription_subdisableonerr - 1]
> > +       = true;
> > +  }
> >
> > The 2nd line is meant to instead be
> > "replaces[Anum_pg_subscription_subdisableonerr - 1] = true".
> > (compare to handling for other similar options)
> Oops, fixed.
>
> > src/backend/replication/logical/worker.c
> > (2) unreachable code?
> > In the patch code there seems to be some instances of unreachable code after
> > re-throwing errors:
> >
> > e.g.
> >
> > + /* If we caught an error above, disable the subscription */ if
> > + (disable_subscription) {
> > +   ReThrowError(DisableSubscriptionOnError(cctx));
> > +   MemoryContextSwitchTo(ecxt);
> > + }
> >
> > + else
> > + {
> > +   PG_RE_THROW();
> > +   MemoryContextSwitchTo(ecxt);
> > + }
> >
> >
> > + if (disable_subscription)
> > + {
> > +   ReThrowError(DisableSubscriptionOnError(cctx));
> > +   MemoryContextSwitchTo(ecxt);
> > + }
> >
> > I'm guessing it was intended to do the "MemoryContextSwitch(ecxt);"
> > before re-throwing (?), but it's not really clear, as in the 1st and 3rd cases, the
> > DisableSubscriptionOnError() calls anyway immediately switch the memory
> > context to cctx.
> You are right I think.
> Fixed based on an idea below.
>
> After an error happens, for some additional work
> (e.g. to report the stats of table sync/apply worker
> by pgstat_report_subworker_error() or
> to update the catalog by DisableSubscriptionOnError())
> restore the memory context that is used before the error (cctx)
> and save the old memory context of error (ecxt). Then,
> do the additional work and switch the memory context to the ecxt
> just before the rethrow. As you described,
> in contrast to PG_RE_THROW, DisableSubscriptionOnError() changes
> the memory context immediatedly at the top of it,
> so for this case, I don't call the MemoryContextSwitchTo().
>
> Another important thing as my modification
> is a case when LogicalRepApplyLoop failed and
> apply_error_callback_arg.command == 0. In the original
> patch of skip xid, it just calls PG_RE_THROW()
> but my previous v3 codes missed this macro in this case.
> Therefore, I've fixed this part as well.
>
> C codes are checked by pgindent.
>
> Note that this depends on the v20 skip xide patch in [1]
>

Thanks for the updated patch, Few comments:
1)  tab completion should be added for disable_on_error:
/* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
else if (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
  "enabled", "slot_name", "streaming",
  "synchronous_commit", "two_phase");

2) disable_on_error is supported by alter subscription, the same
should be documented:
@ -871,11 +886,19 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
                        {
                                supported_opts = (SUBOPT_SLOT_NAME |

SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
-
SUBOPT_STREAMING);
+
SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR);

                                parse_subscription_options(pstate,
stmt->options,

            supported_opts, &opts);

+                               if (IsSet(opts.specified_opts,
SUBOPT_DISABLE_ON_ERR))
+                               {
+
values[Anum_pg_subscription_subdisableonerr - 1]
+                                               =
BoolGetDatum(opts.disableonerr);
+
replaces[Anum_pg_subscription_subdisableonerr - 1]
+                                               = true;
+                               }
+

3) Describe subscriptions (dRs+) should include displaying of disableonerr:
\dRs+ sub1
                                                      List of subscriptions
 Name |  Owner  | Enabled | Publication | Binary | Streaming | Two
phase commit | Synchronous commit |         Conninfo

------+---------+---------+-------------+--------+-----------+------------------+--------------------+---------------------------
 sub1 | vignesh | t       | {pub1}      | f      | f         | d
         | off                | dbname=postgres port=5432
(1 row)

4) I felt transicent should be transient, might be a typo:
+          Specifies whether the subscription should be automatically disabled
+          if replicating data from the publisher triggers non-transicent errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.

5) The commented use PostgresNode and use TestLib can be removed:
+# Test of logical replication subscription self-disabling feature
+use strict;
+use warnings;
+# use PostgresNode;
+# use TestLib;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 10;

Regards,
Vignesh



Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Thu, Nov 11, 2021 at 8:20 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> C codes are checked by pgindent.
>
> Note that this depends on the v20 skip xide patch in [1]
>

Some comments on the v4 patch:

(1) Patch subject
I think the patch subject should say "disable" instead of "disabling":
   Optionally disable subscriptions on error

doc/src/sgml/ref/create_subscription.sgml
(2) spelling mistake
+          if replicating data from the publisher triggers non-transicent errors

non-transicent -> non-transient

(I notice Vignesh also pointed this out)

src/backend/replication/logical/worker.c
(3) calling geterrcode()
The new IsSubscriptionDisablingError() function calls geterrcode().
According to the function comment for geterrcode(), it is only
intended for use in error callbacks.
Instead of calling geterrcode(), couldn't the ErrorData from PG_CATCH
block be passed to IsSubscriptionDisablingError() instead (from which
it can get the sqlerrcode)?

(4) DisableSubscriptionOnError
DisableSubscriptionOnError() is again calling MemoryContextSwitch()
and CopyErrorData().
I think the ErrorData from the PG_CATCH block could simply be passed
to DisableSubscriptionOnError() instead of the memory-context, and the
existing MemoryContextSwitch() and CopyErrorData() calls could be
removed from it.

AFAICS, applying (3) and (4) above would make the code a lot cleaner.


Regards,
Greg Nancarrow
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Friday, November 12, 2021 1:09 PM vignesh C <vignesh21@gmail.com> wrote:
> On Thu, Nov 11, 2021 at 2:50 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> Thanks for the updated patch, Few comments:
> 1)  tab completion should be added for disable_on_error:
> /* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */ else if
> (HeadMatches("CREATE", "SUBSCRIPTION") && TailMatches("WITH", "("))
> COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
>   "enabled", "slot_name", "streaming",
>   "synchronous_commit", "two_phase");
Fixed.

> 2) disable_on_error is supported by alter subscription, the same should be
> documented:
> @ -871,11 +886,19 @@ AlterSubscription(ParseState *pstate,
> AlterSubscriptionStmt *stmt,
>                         {
>                                 supported_opts = (SUBOPT_SLOT_NAME |
> 
> SUBOPT_SYNCHRONOUS_COMMIT | SUBOPT_BINARY |
> -
> SUBOPT_STREAMING);
> +
> SUBOPT_STREAMING | SUBOPT_DISABLE_ON_ERR);
> 
>                                 parse_subscription_options(pstate,
> stmt->options,
> 
>             supported_opts, &opts);
> 
> +                               if (IsSet(opts.specified_opts,
> SUBOPT_DISABLE_ON_ERR))
> +                               {
> +
> values[Anum_pg_subscription_subdisableonerr - 1]
> +                                               =
> BoolGetDatum(opts.disableonerr);
> +
> replaces[Anum_pg_subscription_subdisableonerr - 1]
> +                                               = true;
> +                               }
> +
Fixed the documentation. Also, add one test for alter subscription.

 
> 3) Describe subscriptions (dRs+) should include displaying of disableonerr:
> \dRs+ sub1
>                                                       List of subscriptions
> Name |  Owner  | Enabled | Publication | Binary | Streaming | Two
> phase commit | Synchronous commit |         Conninfo
> ------+---------+---------+-------------+--------+-----------+----------
> --------+--------------------+---------------------------
>  sub1 | vignesh | t       | {pub1}      | f      | f         | d
>          | off                | dbname=postgres port=5432
> (1 row)
Fixed.


> 4) I felt transicent should be transient, might be a typo:
> +          Specifies whether the subscription should be automatically
> disabled
> +          if replicating data from the publisher triggers non-transicent errors
> +          such as referential integrity or permissions errors. The default is
> +          <literal>false</literal>.
Fixed.

> 5) The commented use PostgresNode and use TestLib can be removed:
> +# Test of logical replication subscription self-disabling feature use
> +strict; use warnings; # use PostgresNode; # use TestLib; use
> +PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More
> +tests => 10;
Removed.


Also, my colleague Greg provided an offlist patch to me and
I've incorporated his suggested modifications into this version.
So, I noted his name as a coauthor.

C codes are checked by pgindent again.

This v5 depends on v23 skip xid in [1].

[1] - https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
Thank you for checking the patch !

On Friday, November 12, 2021 1:49 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Thu, Nov 11, 2021 at 8:20 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> Some comments on the v4 patch:
> 
> (1) Patch subject
> I think the patch subject should say "disable" instead of "disabling":
>    Optionally disable subscriptions on error
Fixed.

 
> doc/src/sgml/ref/create_subscription.sgml
> (2) spelling mistake
> +          if replicating data from the publisher triggers
> + non-transicent errors
> 
> non-transicent -> non-transient
Fixed.

 
> (I notice Vignesh also pointed this out)
> 
> src/backend/replication/logical/worker.c
> (3) calling geterrcode()
> The new IsSubscriptionDisablingError() function calls geterrcode().
> According to the function comment for geterrcode(), it is only intended for use
> in error callbacks.
> Instead of calling geterrcode(), couldn't the ErrorData from PG_CATCH block be
> passed to IsSubscriptionDisablingError() instead (from which it can get the
> sqlerrcode)?
> 
> (4) DisableSubscriptionOnError
> DisableSubscriptionOnError() is again calling MemoryContextSwitch() and
> CopyErrorData().
> I think the ErrorData from the PG_CATCH block could simply be passed to
> DisableSubscriptionOnError() instead of the memory-context, and the existing
> MemoryContextSwitch() and CopyErrorData() calls could be removed from it.
> 
> AFAICS, applying (3) and (4) above would make the code a lot cleaner.
Fixed.

The updated patch is shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373771371B31E1E6CC74B0AED999%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Tue, Nov 16, 2021 at 6:53 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> This v5 depends on v23 skip xid in [1].
>

A minor comment:

doc/src/sgml/ref/alter_subscription.sgml
(1) disable_on_err?

+      <literal>disable_on_err</literal>.

This doc update names the new parameter as "disable_on_err" instead of
"disable_on_error".
Also "disable_on_err" appears in a couple of the test case comments.

Regards,
Greg Nancarrow
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Thursday, November 18, 2021 2:08 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> A minor comment:
Thanks for your comments !
 
> doc/src/sgml/ref/alter_subscription.sgml
> (1) disable_on_err?
> 
> +      <literal>disable_on_err</literal>.
> 
> This doc update names the new parameter as "disable_on_err" instead of
> "disable_on_error".
> Also "disable_on_err" appears in a couple of the test case comments.
Fixed all 3 places.

At the same time, I changed one function name
from IsSubscriptionDisablingError() to IsTransientError()
so that it can express what it checks correctly.
Of course, the return value of true or false
becomes reverse by this name change, but
This would make the function more general.
Also, its comments were fixed.

This version also depends on the v23 of skip xid [1]


[1] - https://www.postgresql.org/message-id/CAD21AoA5jupM6O%3DpYsyfaxQ1aMX-en8%3DQNgpW6KfXsg7_CS0CQ%40mail.gmail.com


Best Regards,
    Takamichi Osumi


Attachment
On Thu, Nov 18, 2021 at 12:52 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Thursday, November 18, 2021 2:08 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > A minor comment:
> Thanks for your comments !
>
> > doc/src/sgml/ref/alter_subscription.sgml
> > (1) disable_on_err?
> >
> > +      <literal>disable_on_err</literal>.
> >
> > This doc update names the new parameter as "disable_on_err" instead of
> > "disable_on_error".
> > Also "disable_on_err" appears in a couple of the test case comments.
> Fixed all 3 places.
>
> At the same time, I changed one function name
> from IsSubscriptionDisablingError() to IsTransientError()
> so that it can express what it checks correctly.
> Of course, the return value of true or false
> becomes reverse by this name change, but
> This would make the function more general.
> Also, its comments were fixed.
>
> This version also depends on the v23 of skip xid [1]

Few comments:
1) Changes to handle pg_dump are missing. It should be done in
dumpSubscription and getSubscriptions

2) "And" is missing
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>
+      <literal>disable_on_error</literal>.
Should be:
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>, and
+      <literal>disable_on_error</literal>.

3) Should we change this :
+          Specifies whether the subscription should be automatically disabled
+          if replicating data from the publisher triggers non-transient errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
to:
+          Specifies whether the subscription should be automatically disabled
+          while replicating data from the publisher triggers
non-transient errors
+          such as referential integrity, permissions errors, etc. The
default is
+          <literal>false</literal>.

Regards,
Vignesh



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, November 22, 2021 3:53 PM vignesh C <vignesh21@gmail.com> wrote:
> Few comments:
Thank you so much for your review !

> 1) Changes to handle pg_dump are missing. It should be done in
> dumpSubscription and getSubscriptions
Fixed.

> 2) "And" is missing
> --- a/doc/src/sgml/ref/alter_subscription.sgml
> +++ b/doc/src/sgml/ref/alter_subscription.sgml
> @@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
> class="parameter">name</replaceable> RENAME TO <
>        information.  The parameters that can be altered
>        are <literal>slot_name</literal>,
>        <literal>synchronous_commit</literal>,
> -      <literal>binary</literal>, and
> -      <literal>streaming</literal>.
> +      <literal>binary</literal>,<literal>streaming</literal>
> +      <literal>disable_on_error</literal>.
> Should be:
> -      <literal>binary</literal>, and
> -      <literal>streaming</literal>.
> +      <literal>binary</literal>,<literal>streaming</literal>, and
> +      <literal>disable_on_error</literal>.
Fixed.

> 3) Should we change this :
> +          Specifies whether the subscription should be automatically
> disabled
> +          if replicating data from the publisher triggers non-transient errors
> +          such as referential integrity or permissions errors. The default is
> +          <literal>false</literal>.
> to:
> +          Specifies whether the subscription should be automatically
> disabled
> +          while replicating data from the publisher triggers
> non-transient errors
> +          such as referential integrity, permissions errors, etc. The
> default is
> +          <literal>false</literal>.
I preferred the previous description. The option
"disable_on_error" works with even one error.
If we use "while", the nuance would be like
we keep disabling a subscription more than once.
This situation happens only when user makes
the subscription enable without resolving the non-transient error,
which sounds a bit unnatural. So, I wanna keep the previous description.
If you are not satisfied with this, kindly let me know.

This v7 uses v26 of skip xid patch [1]

[1] - https://www.postgresql.org/message-id/CAD21AoDNe_O%2BCPucd_jQPu3gGGaCLNP%2BJ_kSPNecTdAM8HFPww%40mail.gmail.com

Best Regards,
    Takamichi Osumi


Attachment
On Fri, Nov 26, 2021 at 8:06 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, November 22, 2021 3:53 PM vignesh C <vignesh21@gmail.com> wrote:
> > Few comments:
> Thank you so much for your review !
>
> > 1) Changes to handle pg_dump are missing. It should be done in
> > dumpSubscription and getSubscriptions
> Fixed.
>
> > 2) "And" is missing
> > --- a/doc/src/sgml/ref/alter_subscription.sgml
> > +++ b/doc/src/sgml/ref/alter_subscription.sgml
> > @@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
> > class="parameter">name</replaceable> RENAME TO <
> >        information.  The parameters that can be altered
> >        are <literal>slot_name</literal>,
> >        <literal>synchronous_commit</literal>,
> > -      <literal>binary</literal>, and
> > -      <literal>streaming</literal>.
> > +      <literal>binary</literal>,<literal>streaming</literal>
> > +      <literal>disable_on_error</literal>.
> > Should be:
> > -      <literal>binary</literal>, and
> > -      <literal>streaming</literal>.
> > +      <literal>binary</literal>,<literal>streaming</literal>, and
> > +      <literal>disable_on_error</literal>.
> Fixed.
>
> > 3) Should we change this :
> > +          Specifies whether the subscription should be automatically
> > disabled
> > +          if replicating data from the publisher triggers non-transient errors
> > +          such as referential integrity or permissions errors. The default is
> > +          <literal>false</literal>.
> > to:
> > +          Specifies whether the subscription should be automatically
> > disabled
> > +          while replicating data from the publisher triggers
> > non-transient errors
> > +          such as referential integrity, permissions errors, etc. The
> > default is
> > +          <literal>false</literal>.
> I preferred the previous description. The option
> "disable_on_error" works with even one error.
> If we use "while", the nuance would be like
> we keep disabling a subscription more than once.
> This situation happens only when user makes
> the subscription enable without resolving the non-transient error,
> which sounds a bit unnatural. So, I wanna keep the previous description.
> If you are not satisfied with this, kindly let me know.
>
> This v7 uses v26 of skip xid patch [1]

Thanks for the updated patch, Few comments:
1) Since this function is used only from 027_disable_on_error and not
used by others, this can be moved to 027_disable_on_error:
+sub wait_for_subscriptions
+{
+       my ($self, $dbname, @subscriptions) = @_;
+
+       # Unique-ify the subscriptions passed by the caller
+       my %unique = map { $_ => 1 } @subscriptions;
+       my @unique = sort keys %unique;
+       my $unique_count = scalar(@unique);
+
+       # Construct a SQL list from the unique subscription names
+       my @escaped = map { s/'/''/g; s/\\/\\\\/g; $_ } @unique;
+       my $sublist = join(', ', map { "'$_'" } @escaped);
+
+       my $polling_sql = qq(
+               SELECT COUNT(1) = $unique_count FROM
+                       (SELECT s.oid
+                               FROM pg_catalog.pg_subscription s
+                               LEFT JOIN pg_catalog.pg_subscription_rel sr
+                               ON sr.srsubid = s.oid
+                               WHERE (sr IS NULL OR sr.srsubstate IN
('s', 'r'))
+                                 AND s.subname IN ($sublist)
+                                 AND s.subenabled IS TRUE
+                        UNION
+                        SELECT s.oid
+                               FROM pg_catalog.pg_subscription s
+                               WHERE s.subname IN ($sublist)
+                                 AND s.subenabled IS FALSE
+                       ) AS synced_or_disabled
+               );
+       return $self->poll_query_until($dbname, $polling_sql);
+}

2) The empty line after comment is not required, it can be removed
+# Create non-unique data in both schemas on the publisher.
+#
+for $schema (@schemas)
+{

3) Similarly it can be changed across the file
+# Wait for the initial subscription synchronizations to finish or fail.
+#
+$node_subscriber->wait_for_subscriptions('postgres', @schemas)
+       or die "Timed out while waiting for subscriber to synchronize data";

+# Enter unique data for both schemas on the publisher.  This should succeed on
+# the publisher node, and not cause any additional problems on the subscriber
+# side either, though disabled subscription "s1" should not replicate anything.
+#
+for $schema (@schemas)

4) Since subid is used only at one place, no need of subid variable,
you could replace subid with subform->oid in LockSharedObject
+       Datum           values[Natts_pg_subscription];
+       HeapTuple       tup;
+       Oid                     subid;
+       Form_pg_subscription subform;

+       subid = subform->oid;
+       LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);

5) "permissions errors" should be "permission errors"
+          Specifies whether the subscription should be automatically disabled
+          if replicating data from the publisher triggers non-transient errors
+          such as referential integrity or permissions errors. The default is
+          <literal>false</literal>.
+         </para>

Regards,
Vignesh



Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> This v7 uses v26 of skip xid patch [1]
>

This patch no longer applies on the latest source.
Also, the patch is missing an update to doc/src/sgml/catalogs.sgml,
for the new "subdisableonerr" column of pg_subscription.


Regards,
Greg Nancarrow
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, November 30, 2021 1:10 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > This v7 uses v26 of skip xid patch [1]
> This patch no longer applies on the latest source.
> Also, the patch is missing an update to doc/src/sgml/catalogs.sgml, for the
> new "subdisableonerr" column of pg_subscription.
Thanks for your review !

Fixed the documentation accordingly. Further,
this comment invoked some more refactoring of codes
since I wrote some internal codes related to
'disable_on_error' in an inconsistent order.
I fixed this by keeping patch's codes
after that of 'two_phase' subscription option as much as possible.

I also conducted both pgindent and pgperltidy.

Now, I'll share the v8 that uses PG
whose commit id is after 8d74fc9 (pg_stat_subscription_workers).

Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, November 29, 2021 2:38 PM vignesh C <vignesh21@gmail.com>
> Thanks for the updated patch, Few comments:
Thank you for your review !

> 1) Since this function is used only from 027_disable_on_error and not used by
> others, this can be moved to 027_disable_on_error:
> +sub wait_for_subscriptions
> +{
> +       my ($self, $dbname, @subscriptions) = @_;
> +
> +       # Unique-ify the subscriptions passed by the caller
> +       my %unique = map { $_ => 1 } @subscriptions;
> +       my @unique = sort keys %unique;
> +       my $unique_count = scalar(@unique);
> +
> +       # Construct a SQL list from the unique subscription names
> +       my @escaped = map { s/'/''/g; s/\\/\\\\/g; $_ } @unique;
> +       my $sublist = join(', ', map { "'$_'" } @escaped);
> +
> +       my $polling_sql = qq(
> +               SELECT COUNT(1) = $unique_count FROM
> +                       (SELECT s.oid
> +                               FROM pg_catalog.pg_subscription s
> +                               LEFT JOIN pg_catalog.pg_subscription_rel
> sr
> +                               ON sr.srsubid = s.oid
> +                               WHERE (sr IS NULL OR sr.srsubstate IN
> ('s', 'r'))
> +                                 AND s.subname IN ($sublist)
> +                                 AND s.subenabled IS TRUE
> +                        UNION
> +                        SELECT s.oid
> +                               FROM pg_catalog.pg_subscription s
> +                               WHERE s.subname IN ($sublist)
> +                                 AND s.subenabled IS FALSE
> +                       ) AS synced_or_disabled
> +               );
> +       return $self->poll_query_until($dbname, $polling_sql); }
Fixed.

> 2) The empty line after comment is not required, it can be removed
> +# Create non-unique data in both schemas on the publisher.
> +#
> +for $schema (@schemas)
> +{
Fixed.

> 3) Similarly it can be changed across the file
> +# Wait for the initial subscription synchronizations to finish or fail.
> +#
> +$node_subscriber->wait_for_subscriptions('postgres', @schemas)
> +       or die "Timed out while waiting for subscriber to synchronize
> +data";
> 
> +# Enter unique data for both schemas on the publisher.  This should
> +succeed on # the publisher node, and not cause any additional problems
> +on the subscriber # side either, though disabled subscription "s1" should not
> replicate anything.
> +#
> +for $schema (@schemas)
Fixed.
 
> 4) Since subid is used only at one place, no need of subid variable, you could
> replace subid with subform->oid in LockSharedObject
> +       Datum           values[Natts_pg_subscription];
> +       HeapTuple       tup;
> +       Oid                     subid;
> +       Form_pg_subscription subform;
> 
> +       subid = subform->oid;
> +       LockSharedObject(SubscriptionRelationId, subid, 0,
> + AccessExclusiveLock);
Fixed.

> 5) "permissions errors" should be "permission errors"
> +          Specifies whether the subscription should be automatically
> disabled
> +          if replicating data from the publisher triggers non-transient errors
> +          such as referential integrity or permissions errors. The default is
> +          <literal>false</literal>.
> +         </para>
Fixed.

The new patch v8 is shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB83735AA021E0F614A3AB3221ED679%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
    Takamichi Osumi


On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Tuesday, November 30, 2021 1:10 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > This v7 uses v26 of skip xid patch [1]
> > This patch no longer applies on the latest source.
> > Also, the patch is missing an update to doc/src/sgml/catalogs.sgml, for the
> > new "subdisableonerr" column of pg_subscription.
> Thanks for your review !
>
> Fixed the documentation accordingly. Further,
> this comment invoked some more refactoring of codes
> since I wrote some internal codes related to
> 'disable_on_error' in an inconsistent order.
> I fixed this by keeping patch's codes
> after that of 'two_phase' subscription option as much as possible.
>
> I also conducted both pgindent and pgperltidy.
>
> Now, I'll share the v8 that uses PG
> whose commit id is after 8d74fc9 (pg_stat_subscription_workers).

Thanks for the updated patch, few small comments:
1) This should be changed:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects an error
+      </para></entry>
+     </row>

to:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects non-transient errors
+      </para></entry>
+     </row>


2) "Disable On Err" can be changed to "Disable On Error"
+                                                         ",
subtwophasestate AS \"%s\"\n"
+                                                         ",
subdisableonerr AS \"%s\"\n",
+
gettext_noop("Two phase commit"),
+
gettext_noop("Disable On Err"));

3) Can add a line in the commit message saying "Bump catalog version."
as the patch involves changing the catalog.

4) This prototype is not required, since the function is called after
the function definition:
 static inline void set_apply_error_context_xact(TransactionId xid,
TimestampTz ts);
 static inline void reset_apply_error_context_info(void);
+static bool IsTransientError(ErrorData *edata);

5) we could use the new style here:
+       ereport(LOG,
+                       (errmsg("logical replication subscription
\"%s\" will be disabled due to error: %s",
+                                       MySubscription->name, edata->message)));

change it to:
+       ereport(LOG,
+                       errmsg("logical replication subscription
\"%s\" will be disabled due to error: %s",
+                                       MySubscription->name, edata->message));

Similarly it can be changed in the other ereports added.

Regards,
Vignesh



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, December 1, 2021 3:02 PM vignesh C <vignesh21@gmail.com> wrote:
> On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Tuesday, November 30, 2021 1:10 PM Greg Nancarrow
> <gregn4422@gmail.com> wrote:
> > > On Sat, Nov 27, 2021 at 1:36 AM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > This v7 uses v26 of skip xid patch [1]
> > > This patch no longer applies on the latest source.
> > > Also, the patch is missing an update to doc/src/sgml/catalogs.sgml,
> > > for the new "subdisableonerr" column of pg_subscription.
> > Thanks for your review !
> >
> > Fixed the documentation accordingly. Further, this comment invoked
> > some more refactoring of codes since I wrote some internal codes
> > related to 'disable_on_error' in an inconsistent order.
> > I fixed this by keeping patch's codes
> > after that of 'two_phase' subscription option as much as possible.
> >
> > I also conducted both pgindent and pgperltidy.
> >
> > Now, I'll share the v8 that uses PG
> > whose commit id is after 8d74fc9 (pg_stat_subscription_workers).
> 
> Thanks for the updated patch, few small comments:
I appreciate your check.

> 1) This should be changed:
> +       <structfield>subdisableonerr</structfield> <type>bool</type>
> +      </para>
> +      <para>
> +       If true, the subscription will be disabled when subscription
> +       worker detects an error
> +      </para></entry>
> +     </row>
> 
> to:
> +       <structfield>subdisableonerr</structfield> <type>bool</type>
> +      </para>
> +      <para>
> +       If true, the subscription will be disabled when subscription
> +       worker detects non-transient errors
> +      </para></entry>
> +     </row>
Fixed. Actually, there's no clear definition what "non-transient" means
in the documentation. So, I added some words to your suggestion,
which would give clearer understanding to users.

> 2) "Disable On Err" can be changed to "Disable On Error"
> +                                                         ",
> subtwophasestate AS \"%s\"\n"
> +                                                         ",
> subdisableonerr AS \"%s\"\n",
> +
> gettext_noop("Two phase commit"),
> +
> gettext_noop("Disable On Err"));
Fixed.

> 3) Can add a line in the commit message saying "Bump catalog version."
> as the patch involves changing the catalog.
Hmm, let me postpone this fix till the final version.
The catalog version gets easily updated by other patch commits
and including it in the middle of development can become
cause of conflicts of my patch when applied to the PG,
which is possible to make other reviewers stop reviewing.

> 4) This prototype is not required, since the function is called after the function
> definition:
>  static inline void set_apply_error_context_xact(TransactionId xid,
> TimestampTz ts);  static inline void reset_apply_error_context_info(void);
> +static bool IsTransientError(ErrorData *edata);
Fixed.

> 5) we could use the new style here:
> +       ereport(LOG,
> +                       (errmsg("logical replication subscription
> \"%s\" will be disabled due to error: %s",
> +                                       MySubscription->name,
> + edata->message)));
> 
> change it to:
> +       ereport(LOG,
> +                       errmsg("logical replication subscription
> \"%s\" will be disabled due to error: %s",
> +                                       MySubscription->name,
> + edata->message));
> 
> Similarly it can be changed in the other ereports added.
Removed the unnecessary parentheses.

Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Wed, Dec 1, 2021 at 5:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Wednesday, December 1, 2021 3:02 PM vignesh C <vignesh21@gmail.com> wrote:
> > On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
>
> > 3) Can add a line in the commit message saying "Bump catalog version."
> > as the patch involves changing the catalog.
> Hmm, let me postpone this fix till the final version.
> The catalog version gets easily updated by other patch commits
> and including it in the middle of development can become
> cause of conflicts of my patch when applied to the PG,
> which is possible to make other reviewers stop reviewing.
>

Vignesh seems to be suggesting just changing the commit message, not
the actual code. This is sort of a reminder to the committer to change
the catversion before pushing the patch. So that shouldn't cause any
conflicts while applying your patch.

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, December 1, 2021 10:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Dec 1, 2021 at 5:55 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Wednesday, December 1, 2021 3:02 PM vignesh C
> <vignesh21@gmail.com> wrote:
> > > On Tue, Nov 30, 2021 at 5:34 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> >
> > > 3) Can add a line in the commit message saying "Bump catalog version."
> > > as the patch involves changing the catalog.
> > Hmm, let me postpone this fix till the final version.
> > The catalog version gets easily updated by other patch commits and
> > including it in the middle of development can become cause of
> > conflicts of my patch when applied to the PG, which is possible to
> > make other reviewers stop reviewing.
> >
> 
> Vignesh seems to be suggesting just changing the commit message, not the
> actual code. This is sort of a reminder to the committer to change the catversion
> before pushing the patch. So that shouldn't cause any conflicts while applying
> your patch.
Ah, sorry for my misunderstanding.
Updated the patch to include the notification.


Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Thu, Dec 2, 2021 at 12:05 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Updated the patch to include the notification.
>

For the catalogs.sgml update, I was thinking that the following
wording might sound a bit better:

+       If true, the subscription will be disabled when a subscription
+       worker detects non-transient errors (e.g. duplication error)
+       that require user intervention to resolve.

What do you think?

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Wednesday, December 1, 2021 10:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Updated the patch to include the notification.
>

The patch disables the subscription for non-transient errors. I am not
sure if we can easily make the call to decide whether any particular
error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
might not rectify itself. Why not just allow to disable the
subscription on any error? And then let the user check the error
either in view or logs and decide whether it would like to enable the
subscription or do something before it (like making space in disk, or
fixing the network).

The other problem I see with this transient error stuff is maintaining
the list of error codes that we think are transient. I think we need a
discussion for each of the error_codes we are listing now and whatever
new error_code we add in the future which doesn't seem like a good
idea.

I think the code to deal with apply worker errors and then disable the
subscription has some flaws. Say, while disabling the subscription if
it leads to another error then I think the original error won't be
reported.  Can't we simply emit the error via EmitErrorReport and then
do AbortOutOfAnyTransaction, FlushErrorState, and any other memory
context clean up if required and then disable the subscription after
coming out of catch?

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Thursday, December 2, 2021 1:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Wednesday, December 1, 2021 10:16 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > Updated the patch to include the notification.
> >
> The patch disables the subscription for non-transient errors. I am not sure if we
> can easily make the call to decide whether any particular error is transient or
> not. For example, DISK_FULL or OUT_OF_MEMORY might not rectify itself.
> Why not just allow to disable the subscription on any error? And then let the
> user check the error either in view or logs and decide whether it would like to
> enable the subscription or do something before it (like making space in disk, or
> fixing the network).
Agreed. I'll treat any errors as the trigger of the feature
in the next version.

> The other problem I see with this transient error stuff is maintaining the list of
> error codes that we think are transient. I think we need a discussion for each of
> the error_codes we are listing now and whatever new error_code we add in the
> future which doesn't seem like a good idea.
This is also true. The maintenance cost of my current implementation
didn't sound cheap.

> I think the code to deal with apply worker errors and then disable the
> subscription has some flaws. Say, while disabling the subscription if it leads to
> another error then I think the original error won't be reported.  Can't we simply
> emit the error via EmitErrorReport and then do AbortOutOfAnyTransaction,
> FlushErrorState, and any other memory context clean up if required and then
> disable the subscription after coming out of catch?
You are right. I'll fix related parts accordingly.

Best Regards,
    Takamichi Osumi


RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
Thursday, December 2, 2021 4:41 PM I wrote:
> On Thursday, December 2, 2021 1:49 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > On Thu, Dec 2, 2021 at 6:35 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Wednesday, December 1, 2021 10:16 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > > Updated the patch to include the notification.
> > >
> > The patch disables the subscription for non-transient errors. I am not
> > sure if we can easily make the call to decide whether any particular
> > error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
> might not rectify itself.
> > Why not just allow to disable the subscription on any error? And then
> > let the user check the error either in view or logs and decide whether
> > it would like to enable the subscription or do something before it
> > (like making space in disk, or fixing the network).
> Agreed. I'll treat any errors as the trigger of the feature in the next version.
> 
> > The other problem I see with this transient error stuff is maintaining
> > the list of error codes that we think are transient. I think we need a
> > discussion for each of the error_codes we are listing now and whatever
> > new error_code we add in the future which doesn't seem like a good idea.
> This is also true. The maintenance cost of my current implementation didn't
> sound cheap.
> 
> > I think the code to deal with apply worker errors and then disable the
> > subscription has some flaws. Say, while disabling the subscription if
> > it leads to another error then I think the original error won't be
> > reported.  Can't we simply emit the error via EmitErrorReport and then
> > do AbortOutOfAnyTransaction, FlushErrorState, and any other memory
> > context clean up if required and then disable the subscription after coming
> out of catch?
> You are right. I'll fix related parts accordingly.
Hi, I've made a new patch v11 that incorporated suggestions described above.

There are several notes to share regarding v11 modifications.

1. Modified the commit message a bit.

2. DisableSubscriptionOnError() doesn't return ErrData anymore,
since now to emit error message is done in the error recovery area
and the function purpose has become purely to run a transaction to disable
the subscription.

3. In DisableSubscriptionOnError(), v10 rethrew the error if the disable_on_error
flag became false in the interim, but v11 just closes the transaction and
finishes the function.

4. If table sync worker detects an error during synchronization
and needs to disable the subscription, the worker disables it and just exit by proc_exit.
The processing after disabling the subscription didn't look necessary to me
for disabled subscription.

5. Only when we succeed in the table synchronization, it's necessary to
allocate slot name in long-lived context, after the table synchronization in
ApplyWorkerMain(). Otherwise, we'll see junk value of syncslotname
because it is the return value of LogicalRepSyncTableStart().

6. There are 3 places for error recovery in ApplyWorkerMain().
All of those might look similar but I didn't make an united function for them.
Those are slightly different from each other and I felt
readability is reduced by grouping them into one type of function call.

7. In v11, I covered the case that apply worker failed with
apply_error_callback_arg.command == 0, as one path to disable the subscription
in order to cover all errors.

8. I changed one flag name from 'disable_subscription' to 'did_error'
in ApplyWorkerMain().

9. All chages in this version are C codes and checked by pgindent.

Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Sat, Dec 4, 2021 at 12:20 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Hi, I've made a new patch v11 that incorporated suggestions described above.
>

Some review comments for the v11 patch:

doc/src/sgml/ref/create_subscription.sgml
(1) Possible wording improvement?

BEFORE:
+  Specifies whether the subscription should be automatically disabled
+  if replicating data from the publisher triggers errors. The default
+  is <literal>false</literal>.
AFTER:
+  Specifies whether the subscription should be automatically disabled
+  if any errors are detected by subscription workers during data
+  replication from the publisher. The default is <literal>false</literal>.

src/backend/replication/logical/worker.c
(2) WorkerErrorRecovery comments
Instead of:

+ * As a preparation for disabling the subscription, emit the error,
+ * handle the transaction and clean up the memory context of
+ * error. ErrorContext is reset by FlushErrorState.

why not just say:

+ Worker error recovery processing, in preparation for disabling the
+ subscription.

And then comment the function's code lines:

e.g.

/* Emit the error */
...
/* Abort any active transaction */
...
/* Reset the ErrorContext */
...

(3) DisableSubscriptionOnError return

The "if (!subform->subdisableonerr)" block should probably first:
   heap_freetuple(tup);

(regardless of the fact the only current caller will proc_exit anyway)

(4) did_error flag

I think perhaps the previously-used flag name "disable_subscription"
is better, or maybe "error_recovery_done".
Also, I think it would look better if it was set AFTER
WorkerErrorRecovery() was called.

(5) DisableSubscriptionOnError LOG message

This version of the patch removes the LOG message:

+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to error: %s",
+    MySubscription->name, edata->message));

Perhaps a similar error message could be logged prior to EmitErrorReport()?

e.g.
 "logical replication subscription \"%s\" will be disabled due to an error"


Regards,
Greg Nancarrow
Fujitsu Australia



Re: Optionally automatically disable logical replication subscriptions on error

From
Mark Dilger
Date:

> On Dec 1, 2021, at 8:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> The patch disables the subscription for non-transient errors. I am not
> sure if we can easily make the call to decide whether any particular
> error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
> might not rectify itself. Why not just allow to disable the
> subscription on any error? And then let the user check the error
> either in view or logs and decide whether it would like to enable the
> subscription or do something before it (like making space in disk, or
> fixing the network).

The original idea of the patch, back when I first wrote and proposed it, was to remove the *absurdity* of retrying a
transactionwhich, in the absence of human intervention, was guaranteed to simply fail again ad infinitum.  Retrying in
theface of resource errors is not *absurd* even though it might fail again ad infinitum.  The reason is that there is
atleast a chance that the situation will clear up without human intervention. 

> The other problem I see with this transient error stuff is maintaining
> the list of error codes that we think are transient. I think we need a
> discussion for each of the error_codes we are listing now and whatever
> new error_code we add in the future which doesn't seem like a good
> idea.

A reasonable rule might be:  "the subscription will be disabled if the server can determine that retries cannot
possiblysucceed without human intervention."  We shouldn't need to categorize all error codes perfectly, as long as
we'reconservative.  What I propose is similar to how we determine whether to mark a function leakproof; we don't have
tomark all leakproof functions as such, we just can't mark one as such if it is not. 

If we're going to debate the error codes, I think we would start with an empty list, and add to the list on sufficient
analysis.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, December 6, 2021 1:38 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > On Dec 1, 2021, at 8:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The patch disables the subscription for non-transient errors. I am not
> > sure if we can easily make the call to decide whether any particular
> > error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
> > might not rectify itself. Why not just allow to disable the
> > subscription on any error? And then let the user check the error
> > either in view or logs and decide whether it would like to enable the
> > subscription or do something before it (like making space in disk, or
> > fixing the network).
> 
> The original idea of the patch, back when I first wrote and proposed it, was to
> remove the *absurdity* of retrying a transaction which, in the absence of
> human intervention, was guaranteed to simply fail again ad infinitum.
> Retrying in the face of resource errors is not *absurd* even though it might fail
> again ad infinitum.  The reason is that there is at least a chance that the
> situation will clear up without human intervention.
In my humble opinion, I felt the original purpose of the patch was to partially remedy
the situation that during the failure of apply, the apply process keeps going
into the infinite error loop.

I'd say that in this sense, if we include such resource errors, we fail to achieve
the purpose in some cases, because of some left possibilities of infinite loop.
Disabling the subscription with even one any error excludes this irregular possibility,
since there's no room to continue the infinite loop.

Best Regards,
    Takamichi Osumi


RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, December 6, 2021 1:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Sat, Dec 4, 2021 at 12:20 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Hi, I've made a new patch v11 that incorporated suggestions described
> above.
> >
> 
> Some review comments for the v11 patch:
Thank you for your reviews !
 
> doc/src/sgml/ref/create_subscription.sgml
> (1) Possible wording improvement?
> 
> BEFORE:
> +  Specifies whether the subscription should be automatically disabled
> + if replicating data from the publisher triggers errors. The default
> + is <literal>false</literal>.
> AFTER:
> +  Specifies whether the subscription should be automatically disabled
> + if any errors are detected by subscription workers during data
> + replication from the publisher. The default is <literal>false</literal>.
Fixed.

> src/backend/replication/logical/worker.c
> (2) WorkerErrorRecovery comments
> Instead of:
> 
> + * As a preparation for disabling the subscription, emit the error,
> + * handle the transaction and clean up the memory context of
> + * error. ErrorContext is reset by FlushErrorState.
> 
> why not just say:
> 
> + Worker error recovery processing, in preparation for disabling the
> + subscription.
> 
> And then comment the function's code lines:
> 
> e.g.
> 
> /* Emit the error */
> ...
> /* Abort any active transaction */
> ...
> /* Reset the ErrorContext */
> ...
Agreed. Fixed.
 
> (3) DisableSubscriptionOnError return
> 
> The "if (!subform->subdisableonerr)" block should probably first:
>    heap_freetuple(tup);
> 
> (regardless of the fact the only current caller will proc_exit anyway)
Fixed.
 
> (4) did_error flag
> 
> I think perhaps the previously-used flag name "disable_subscription"
> is better, or maybe "error_recovery_done".
> Also, I think it would look better if it was set AFTER
> WorkerErrorRecovery() was called.
Adopted error_recovery_done
and changed its places accordingly.
 
> (5) DisableSubscriptionOnError LOG message
> 
> This version of the patch removes the LOG message:
> 
> + ereport(LOG,
> + errmsg("logical replication subscription \"%s\" will be disabled due
> to error: %s",
> +    MySubscription->name, edata->message));
> 
> Perhaps a similar error message could be logged prior to EmitErrorReport()?
> 
> e.g.
>  "logical replication subscription \"%s\" will be disabled due to an error"
Added.

I've attached the new version v12.

Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Mon, Dec 6, 2021 at 10:07 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> > On Dec 1, 2021, at 8:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The patch disables the subscription for non-transient errors. I am not
> > sure if we can easily make the call to decide whether any particular
> > error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
> > might not rectify itself. Why not just allow to disable the
> > subscription on any error? And then let the user check the error
> > either in view or logs and decide whether it would like to enable the
> > subscription or do something before it (like making space in disk, or
> > fixing the network).
>
> The original idea of the patch, back when I first wrote and proposed it, was to remove the *absurdity* of retrying a
transactionwhich, in the absence of human intervention, was guaranteed to simply fail again ad infinitum.  Retrying in
theface of resource errors is not *absurd* even though it might fail again ad infinitum.  The reason is that there is
atleast a chance that the situation will clear up without human intervention. 
>
> > The other problem I see with this transient error stuff is maintaining
> > the list of error codes that we think are transient. I think we need a
> > discussion for each of the error_codes we are listing now and whatever
> > new error_code we add in the future which doesn't seem like a good
> > idea.
>
> A reasonable rule might be:  "the subscription will be disabled if the server can determine that retries cannot
possiblysucceed without human intervention."  We shouldn't need to categorize all error codes perfectly, as long as
we'reconservative.  What I propose is similar to how we determine whether to mark a function leakproof; we don't have
tomark all leakproof functions as such, we just can't mark one as such if it is not. 
>
> If we're going to debate the error codes, I think we would start with an empty list, and add to the list on
sufficientanalysis. 
>

Yeah, an empty list is a sort of what I thought was a good start
point. I feel we should learn from real-world use cases to see if
people really want to continue retrying even after using this option.


--
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Mark Dilger
Date:

> On Dec 5, 2021, at 10:56 PM, osumi.takamichi@fujitsu.com wrote:
>
> In my humble opinion, I felt the original purpose of the patch was to partially remedy
> the situation that during the failure of apply, the apply process keeps going
> into the infinite error loop.

I agree.

> I'd say that in this sense, if we include such resource errors, we fail to achieve
> the purpose in some cases, because of some left possibilities of infinite loop.
> Disabling the subscription with even one any error excludes this irregular possibility,
> since there's no room to continue the infinite loop.

I don't think there is any right answer here.  It's a question of policy preferences.

My concern about disabling a subscription in response to *any* error is that people may find the feature does more harm
thangood.  Disabling the subscription in response to an occasional deadlock against other database users, or occasional
resourcepressure, might annoy people and lead to the feature simply not being used. 

I am happy to defer to your policy preference.  Thanks for your work on the patch!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Tue, Dec 7, 2021 at 3:06 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> My concern about disabling a subscription in response to *any* error is that people may find the feature does more
harmthan good.  Disabling the subscription in response to an occasional deadlock against other database users, or
occasionalresource pressure, might annoy people and lead to the feature simply not being used. 
>
I can understand this point of view.
It kind of suggests to me the possibility of something like a
configurable timeout (e.g. disable the subscription if the same error
has occurred for more than X minutes) or, similarly, perhaps if some
threshold has been reached (e.g. same error has occurred more than X
times), but I think that this was previously suggested by Peter Smith
and the idea wasn't looked upon all that favorably?

Regards,
Greg Nancarrow
Fujitsu Australia



Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Tue, Dec 7, 2021 at 5:52 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Tue, Dec 7, 2021 at 3:06 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> >
> > My concern about disabling a subscription in response to *any* error is that people may find the feature does more
harmthan good.  Disabling the subscription in response to an occasional deadlock against other database users, or
occasionalresource pressure, might annoy people and lead to the feature simply not being used. 
> >
> I can understand this point of view.
> It kind of suggests to me the possibility of something like a
> configurable timeout (e.g. disable the subscription if the same error
> has occurred for more than X minutes) or, similarly, perhaps if some
> threshold has been reached (e.g. same error has occurred more than X
> times), but I think that this was previously suggested by Peter Smith
> and the idea wasn't looked upon all that favorably?
>

I think if we are really worried about transient errors then probably
the idea "disable only if the same error has occurred more than X
times" seems preferable as compared to taking a decision on which
error_codes fall in the transient error category.

--
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Mark Dilger
Date:

> On Dec 8, 2021, at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think if we are really worried about transient errors then probably
> the idea "disable only if the same error has occurred more than X
> times" seems preferable as compared to taking a decision on which
> error_codes fall in the transient error category.

No need.  We can revisit this design decision in a later release cycle if the current patch's design proves problematic
inthe field. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Wed, Dec 8, 2021 at 9:22 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
> > On Dec 8, 2021, at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think if we are really worried about transient errors then probably
> > the idea "disable only if the same error has occurred more than X
> > times" seems preferable as compared to taking a decision on which
> > error_codes fall in the transient error category.
>
> No need.  We can revisit this design decision in a later release cycle if the current patch's design proves
problematicin the field.
 
>

So, do you agree that we can disable the subscription on any error if
this parameter is set?

-- 
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Mark Dilger
Date:

> On Dec 8, 2021, at 8:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> So, do you agree that we can disable the subscription on any error if
> this parameter is set?

Yes, I think that is fine.  We can commit it that way, and revisit the issue for v16 if it becomes a problem in
practice.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






On Mon, Dec 6, 2021 at 4:22 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, December 6, 2021 1:16 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > On Sat, Dec 4, 2021 at 12:20 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > Hi, I've made a new patch v11 that incorporated suggestions described
> > above.
> > >
> >
> > Some review comments for the v11 patch:
> Thank you for your reviews !
>
> > doc/src/sgml/ref/create_subscription.sgml
> > (1) Possible wording improvement?
> >
> > BEFORE:
> > +  Specifies whether the subscription should be automatically disabled
> > + if replicating data from the publisher triggers errors. The default
> > + is <literal>false</literal>.
> > AFTER:
> > +  Specifies whether the subscription should be automatically disabled
> > + if any errors are detected by subscription workers during data
> > + replication from the publisher. The default is <literal>false</literal>.
> Fixed.
>
> > src/backend/replication/logical/worker.c
> > (2) WorkerErrorRecovery comments
> > Instead of:
> >
> > + * As a preparation for disabling the subscription, emit the error,
> > + * handle the transaction and clean up the memory context of
> > + * error. ErrorContext is reset by FlushErrorState.
> >
> > why not just say:
> >
> > + Worker error recovery processing, in preparation for disabling the
> > + subscription.
> >
> > And then comment the function's code lines:
> >
> > e.g.
> >
> > /* Emit the error */
> > ...
> > /* Abort any active transaction */
> > ...
> > /* Reset the ErrorContext */
> > ...
> Agreed. Fixed.
>
> > (3) DisableSubscriptionOnError return
> >
> > The "if (!subform->subdisableonerr)" block should probably first:
> >    heap_freetuple(tup);
> >
> > (regardless of the fact the only current caller will proc_exit anyway)
> Fixed.
>
> > (4) did_error flag
> >
> > I think perhaps the previously-used flag name "disable_subscription"
> > is better, or maybe "error_recovery_done".
> > Also, I think it would look better if it was set AFTER
> > WorkerErrorRecovery() was called.
> Adopted error_recovery_done
> and changed its places accordingly.
>
> > (5) DisableSubscriptionOnError LOG message
> >
> > This version of the patch removes the LOG message:
> >
> > + ereport(LOG,
> > + errmsg("logical replication subscription \"%s\" will be disabled due
> > to error: %s",
> > +    MySubscription->name, edata->message));
> >
> > Perhaps a similar error message could be logged prior to EmitErrorReport()?
> >
> > e.g.
> >  "logical replication subscription \"%s\" will be disabled due to an error"
> Added.
>
> I've attached the new version v12.

Thanks for the updated patch, few comments:
1) This is not required as it is not used in the caller.
+++ b/src/backend/replication/logical/launcher.c
@@ -132,6 +132,7 @@ get_subscription_list(void)
                sub->dbid = subform->subdbid;
                sub->owner = subform->subowner;
                sub->enabled = subform->subenabled;
+               sub->disableonerr = subform->subdisableonerr;
                sub->name = pstrdup(NameStr(subform->subname));
                /* We don't fill fields we are not interested in. */

2) Should this be changed:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription
+       worker detects any errors
+      </para></entry>
+     </row>
To:
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled when subscription's
+       worker detects any errors
+      </para></entry>
+     </row>

3) The last line can be slightly adjusted to keep within 80 chars:
+          Specifies whether the subscription should be automatically disabled
+          if any errors are detected by subscription workers during data
+          replication from the publisher. The default is
<literal>false</literal>.

4) Similarly this too can be handled:
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1259,7 +1259,7 @@ REVOKE ALL ON pg_replication_origin_status FROM public;
 -- All columns of pg_subscription except subconninfo are publicly readable.
 REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (oid, subdbid, subname, subowner, subenabled, subbinary,
-              substream, subtwophasestate, subslotname,
subsynccommit, subpublications)
+              substream, subtwophasestate, subdisableonerr,
subslotname, subsynccommit, subpublications)
     ON pg_subscription TO public;

5) Since disabling subscription code is common in and else, can we
move it below:
+                       if (MySubscription->disableonerr)
+                       {
+                               WorkerErrorRecovery();
+                               error_recovery_done = true;
+                       }
+                       else
+                       {
+                               /*
+                                * Some work in error recovery work is
done. Switch to the old
+                                * memory context and rethrow.
+                                */
+                               MemoryContextSwitchTo(ecxt);
+                               PG_RE_THROW();
+                       }
+               }
+               else
+               {
+                       /*
+                        * Don't miss any error, even when it's not
reported to stats
+                        * collector.
+                        */
+                       if (MySubscription->disableonerr)
+                       {
+                               WorkerErrorRecovery();
+                               error_recovery_done = true;
+                       }
+                       else
+                               /* Simply rethrow because of no recovery work */
+                               PG_RE_THROW();
+               }

6) Can we move LockSharedObject below the if condition.
+       subform = (Form_pg_subscription) GETSTRUCT(tup);
+       LockSharedObject(SubscriptionRelationId, subform->oid, 0,
AccessExclusiveLock);
+
+       /*
+        * We would not be here unless this subscription's
disableonerr field was
+        * true when our worker began applying changes, but check whether that
+        * field has changed in the interim.
+        */
+       if (!subform->subdisableonerr)
+       {
+               /*
+                * Disabling the subscription has been done already. No need of
+                * additional work.
+                */
+               heap_freetuple(tup);
+               table_close(rel, RowExclusiveLock);
+               CommitTransactionCommand();
+               return;
+       }
+

Regards,
Vignesh



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, December 13, 2021 6:57 PM vignesh C <vignesh21@gmail.com> wrote:
> On Mon, Dec 6, 2021 at 4:22 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > I've attached the new version v12.
I appreciate your review.


> Thanks for the updated patch, few comments:
> 1) This is not required as it is not used in the caller.
> +++ b/src/backend/replication/logical/launcher.c
> @@ -132,6 +132,7 @@ get_subscription_list(void)
>                 sub->dbid = subform->subdbid;
>                 sub->owner = subform->subowner;
>                 sub->enabled = subform->subenabled;
> +               sub->disableonerr = subform->subdisableonerr;
>                 sub->name = pstrdup(NameStr(subform->subname));
>                 /* We don't fill fields we are not interested in. */
Okay.
The comment of the get_subscription_list() mentions that
we collect and fill only fields related to worker start/stop.
Then, I didn't need it. Fixed.


> 2) Should this be changed:
> +       <structfield>subdisableonerr</structfield> <type>bool</type>
> +      </para>
> +      <para>
> +       If true, the subscription will be disabled when subscription
> +       worker detects any errors
> +      </para></entry>
> +     </row>
> To:
> +       <structfield>subdisableonerr</structfield> <type>bool</type>
> +      </para>
> +      <para>
> +       If true, the subscription will be disabled when subscription's
> +       worker detects any errors
> +      </para></entry>
> +     </row>
I felt either is fine. So fixed.


> 3) The last line can be slightly adjusted to keep within 80 chars:
> +          Specifies whether the subscription should be automatically disabled
> +          if any errors are detected by subscription workers during data
> +          replication from the publisher. The default is
> <literal>false</literal>.
Fixed.

> 4) Similarly this too can be handled:
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1259,7 +1259,7 @@ REVOKE ALL ON pg_replication_origin_status FROM
> public;
>  -- All columns of pg_subscription except subconninfo are publicly readable.
>  REVOKE ALL ON pg_subscription FROM public;  GRANT SELECT (oid,
> subdbid, subname, subowner, subenabled, subbinary,
> -              substream, subtwophasestate, subslotname,
> subsynccommit, subpublications)
> +              substream, subtwophasestate, subdisableonerr,
> subslotname, subsynccommit, subpublications)
>      ON pg_subscription TO public;
I split the line into two to make each line less than 80 chars.

> 5) Since disabling subscription code is common in and else, can we move it
> below:
> +                       if (MySubscription->disableonerr)
> +                       {
> +                               WorkerErrorRecovery();
> +                               error_recovery_done = true;
> +                       }
> +                       else
> +                       {
> +                               /*
> +                                * Some work in error recovery work is
> done. Switch to the old
> +                                * memory context and rethrow.
> +                                */
> +                               MemoryContextSwitchTo(ecxt);
> +                               PG_RE_THROW();
> +                       }
> +               }
> +               else
> +               {
> +                       /*
> +                        * Don't miss any error, even when it's not
> reported to stats
> +                        * collector.
> +                        */
> +                       if (MySubscription->disableonerr)
> +                       {
> +                               WorkerErrorRecovery();
> +                               error_recovery_done = true;
> +                       }
> +                       else
> +                               /* Simply rethrow because of no recovery
> work */
> +                               PG_RE_THROW();
> +               }
I moved the common code below those condition branches.


> 6) Can we move LockSharedObject below the if condition.
> +       subform = (Form_pg_subscription) GETSTRUCT(tup);
> +       LockSharedObject(SubscriptionRelationId, subform->oid, 0,
> AccessExclusiveLock);
> +
> +       /*
> +        * We would not be here unless this subscription's
> disableonerr field was
> +        * true when our worker began applying changes, but check whether
> that
> +        * field has changed in the interim.
> +        */
> +       if (!subform->subdisableonerr)
> +       {
> +               /*
> +                * Disabling the subscription has been done already. No need
> of
> +                * additional work.
> +                */
> +               heap_freetuple(tup);
> +               table_close(rel, RowExclusiveLock);
> +               CommitTransactionCommand();
> +               return;
> +       }
> +
Fixed.

Besides all of those changes, I've removed the obsolete
comment of DisableSubscriptionOnError in v12.

Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Greg Nancarrow
Date:
On Tue, Dec 14, 2021 at 4:34 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Besides all of those changes, I've removed the obsolete
> comment of DisableSubscriptionOnError in v12.
>

I have a few minor comments, otherwise the patch LGTM at this point:

doc/src/sgml/catalogs.sgml
(1)
Current comment says:

+       If true, the subscription will be disabled when subscription's
+       worker detects any errors

However, in create_subscription.sgml, it says "disabled if any errors
are detected by subscription workers ..."

For consistency, I think it should be:

+       If true, the subscription will be disabled when subscription
+       workers detect any errors

src/bin/psql/describe.c
(2)
I think that:

+  gettext_noop("Disable On Error"));

should be:

+  gettext_noop("Disable on error"));

for consistency with the uppercase/lowercase usage on other similar entries?
(e.g. "Two phase commit")


src/include/catalog/pg_subscription.h
(3)

+  bool subdisableonerr; /* True if apply errors should disable the
+                                       * subscription upon error */

The comment should just say "True if occurrence of apply errors should
disable the subscription"


Regards,
Greg Nancarrow
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Thursday, December 16, 2021 2:32 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> On Tue, Dec 14, 2021 at 4:34 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Besides all of those changes, I've removed the obsolete comment of
> > DisableSubscriptionOnError in v12.
> >
> 
> I have a few minor comments, otherwise the patch LGTM at this point:
Thank you for your review !

> doc/src/sgml/catalogs.sgml
> (1)
> Current comment says:
> 
> +       If true, the subscription will be disabled when subscription's
> +       worker detects any errors
> 
> However, in create_subscription.sgml, it says "disabled if any errors are
> detected by subscription workers ..."
> 
> For consistency, I think it should be:
> 
> +       If true, the subscription will be disabled when subscription
> +       workers detect any errors
Okay. Fixed.
 
> src/bin/psql/describe.c
> (2)
> I think that:
> 
> +  gettext_noop("Disable On Error"));
> 
> should be:
> 
> +  gettext_noop("Disable on error"));
> 
> for consistency with the uppercase/lowercase usage on other similar entries?
> (e.g. "Two phase commit")
Agreed. Fixed.

> src/include/catalog/pg_subscription.h
> (3)
> 
> +  bool subdisableonerr; /* True if apply errors should disable the
> +                                       * subscription upon error */
> 
> The comment should just say "True if occurrence of apply errors should disable
> the subscription"
Fixed.

Attached the updated patch v14.

Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"wangw.fnst@fujitsu.com"
Date:
On Thursday, December 16, 2021 8:51 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> Attached the updated patch v14.

A comment to the timing of printing a log:
After the log[1] was printed, I altered subscription's option
(DISABLE_ON_ERROR) from true to false before invoking DisableSubscriptionOnError
to disable subscription. Subscription was not disabled.
[1] "LOG:  logical replication subscription "sub1" will be disabled due to an error"

I found this log is printed in function WorkerErrorRecovery:
+    ereport(LOG,
+            errmsg("logical replication subscription \"%s\" will be disabled due to an error",
+                   MySubscription->name));
This log is printed here, but in DisableSubscriptionOnError, there is a check to
confirm subscription's disableonerr field. If disableonerr is found changed from
true to false in DisableSubscriptionOnError, subscription will not be disabled.

In this case, "disable subscription" is printed, but subscription will not be
disabled actually.
I think it is a little confused to user, so what about moving this message after
the check which is mentioned above in DisableSubscriptionOnError?


Regards,
Wang wei

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, December 28, 2021 11:53 AM Wang, Wei/王 威 <wangw.fnst@fujitsu.com> wrote:
> On Thursday, December 16, 2021 8:51 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > Attached the updated patch v14.
> 
> A comment to the timing of printing a log:
Thank you for your review !

> After the log[1] was printed, I altered subscription's option
> (DISABLE_ON_ERROR) from true to false before invoking
> DisableSubscriptionOnError to disable subscription. Subscription was not
> disabled.
> [1] "LOG:  logical replication subscription "sub1" will be disabled due to an
> error"
> 
> I found this log is printed in function WorkerErrorRecovery:
> +    ereport(LOG,
> +            errmsg("logical replication subscription \"%s\" will
> be disabled due to an error",
> +                   MySubscription->name));
> This log is printed here, but in DisableSubscriptionOnError, there is a check to
> confirm subscription's disableonerr field. If disableonerr is found changed from
> true to false in DisableSubscriptionOnError, subscription will not be disabled.
> 
> In this case, "disable subscription" is printed, but subscription will not be
> disabled actually.
> I think it is a little confused to user, so what about moving this message after
> the check which is mentioned above in DisableSubscriptionOnError?
Makes sense. I moved the log print after
the check of the necessity to disable the subscription.

Also, I've scrutinized and refined the new TAP test as well for refactoring.
As a result, I fixed wait_for_subscriptions()
so that some extra codes that can be simplified,
such as escaped variable and one part of WHERE clause, are removed.
Other change I did is to replace two calls of wait_for_subscriptions()
with polling_query_until() for the subscriber, in order to
make the tests better and more suitable for the test purposes.
Again, for this refinement, I've conducted a tight loop test
to check no timing issue and found no problem.

Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"tanghy.fnst@fujitsu.com"
Date:
On Wednesday, January 5, 2022 8:53 PM osumi.takamichi@fujitsu.com 
<osumi.takamichi@fujitsu.com> wrote:
> 
> On Tuesday, December 28, 2021 11:53 AM Wang, Wei/王 威
> <wangw.fnst@fujitsu.com> wrote:
> > On Thursday, December 16, 2021 8:51 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > > Attached the updated patch v14.
> >
> > A comment to the timing of printing a log:
> Thank you for your review !
> 
> > After the log[1] was printed, I altered subscription's option
> > (DISABLE_ON_ERROR) from true to false before invoking
> > DisableSubscriptionOnError to disable subscription. Subscription was not
> > disabled.
> > [1] "LOG:  logical replication subscription "sub1" will be disabled due to an
> > error"
> >
> > I found this log is printed in function WorkerErrorRecovery:
> > +    ereport(LOG,
> > +            errmsg("logical replication subscription \"%s\" will
> > be disabled due to an error",
> > +                   MySubscription->name));
> > This log is printed here, but in DisableSubscriptionOnError, there is a check to
> > confirm subscription's disableonerr field. If disableonerr is found changed from
> > true to false in DisableSubscriptionOnError, subscription will not be disabled.
> >
> > In this case, "disable subscription" is printed, but subscription will not be
> > disabled actually.
> > I think it is a little confused to user, so what about moving this message after
> > the check which is mentioned above in DisableSubscriptionOnError?
> Makes sense. I moved the log print after
> the check of the necessity to disable the subscription.
> 
> Also, I've scrutinized and refined the new TAP test as well for refactoring.
> As a result, I fixed wait_for_subscriptions()
> so that some extra codes that can be simplified,
> such as escaped variable and one part of WHERE clause, are removed.
> Other change I did is to replace two calls of wait_for_subscriptions()
> with polling_query_until() for the subscriber, in order to
> make the tests better and more suitable for the test purposes.
> Again, for this refinement, I've conducted a tight loop test
> to check no timing issue and found no problem.
> 

Thanks for updating the patch. Here are some comments:

1)
+    /*
+     * We would not be here unless this subscription's disableonerr field was
+     * true when our worker began applying changes, but check whether that
+     * field has changed in the interim.
+     */
+    if (!subform->subdisableonerr)
+    {
+        /*
+         * Disabling the subscription has been done already. No need of
+         * additional work.
+         */
+        heap_freetuple(tup);
+        table_close(rel, RowExclusiveLock);
+        CommitTransactionCommand();
+        return;
+    }

I don't understand what does "Disabling the subscription has been done already"
mean, I think we only run here when subdisableonerr is changed in the interim.
Should we modify this comment? Or remove it because there are already some
explanations before.

2)
+    /* Set the subscription to disabled, and note the reason. */
+    values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(false);
+    replaces[Anum_pg_subscription_subenabled - 1] = true;

I didn't see the code corresponding to "note the reason". Should we modify the
comment?

3)
+    bool        disableonerr;    /* Whether errors automatically disable */

This comment is hard to understand. Maybe it can be changed to:

Indicates if the subscription should be automatically disabled when subscription
workers detect any errors.

Regards,
Tang

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Thursday, January 6, 2022 12:17 PM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> Thanks for updating the patch. Here are some comments:
Thank you for your review !

> 1)
> +    /*
> +     * We would not be here unless this subscription's disableonerr field
> was
> +     * true when our worker began applying changes, but check whether
> that
> +     * field has changed in the interim.
> +     */
> +    if (!subform->subdisableonerr)
> +    {
> +        /*
> +         * Disabling the subscription has been done already. No need
> of
> +         * additional work.
> +         */
> +        heap_freetuple(tup);
> +        table_close(rel, RowExclusiveLock);
> +        CommitTransactionCommand();
> +        return;
> +    }
> 
> I don't understand what does "Disabling the subscription has been done
> already"
> mean, I think we only run here when subdisableonerr is changed in the interim.
> Should we modify this comment? Or remove it because there are already some
> explanations before.
Removed. The description you pointed out was redundant.

> 2)
> +    /* Set the subscription to disabled, and note the reason. */
> +    values[Anum_pg_subscription_subenabled - 1] =
> BoolGetDatum(false);
> +    replaces[Anum_pg_subscription_subenabled - 1] = true;
> 
> I didn't see the code corresponding to "note the reason". Should we modify the
> comment?
Fixed the comment by removing the part.
We come here when an error occurred and the reason is printed as log
so no need to note more reason.

> 3)
> +    bool        disableonerr;    /* Whether errors automatically
> disable */
> 
> This comment is hard to understand. Maybe it can be changed to:
> 
> Indicates if the subscription should be automatically disabled when
> subscription workers detect any errors.
Agreed. Fixed.

Kindly have a look at the attached v16.

Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Thu, Jan 6, 2022 at 11:23 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Kindly have a look at the attached v16.
>

Few comments:
=============
1.
@@ -3594,13 +3698,29 @@ ApplyWorkerMain(Datum main_arg)
    apply_error_callback_arg.command,
    apply_error_callback_arg.remote_xid,
    errdata->message);
- MemoryContextSwitchTo(ecxt);
+
+ if (!MySubscription->disableonerr)
+ {
+ /*
+ * Some work in error recovery work is done. Switch to the old
+ * memory context and rethrow.
+ */
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
  }
+ else if (!MySubscription->disableonerr)
+ PG_RE_THROW();

- PG_RE_THROW();

Can't we combine these two different checks for
'MySubscription->disableonerr' if you do it as a separate if check
after sending the stats message?

2. Can we move the code related to tablesync worker and its error
handing (the code insider if (am_tablesync_worker())) to a separate
function say LogicalRepHandleTableSync() or something like that.

3. Similarly, we can move apply-loop related code ("Run the main
loop.") to a separate function say LogicalRepHandleApplyMessages().

If we do (2) and (3), I think the code in ApplyWorkerMain will look
better. What do you think?

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, February 14, 2022 8:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 6, 2022 at 11:23 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Kindly have a look at the attached v16.
> >
> 
> Few comments:
Hi, thank you for checking the patch !

> =============
> 1.
> @@ -3594,13 +3698,29 @@ ApplyWorkerMain(Datum main_arg)
>     apply_error_callback_arg.command,
>     apply_error_callback_arg.remote_xid,
>     errdata->message);
> - MemoryContextSwitchTo(ecxt);
> +
> + if (!MySubscription->disableonerr)
> + {
> + /*
> + * Some work in error recovery work is done. Switch to the old
> + * memory context and rethrow.
> + */
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> + }
>   }
> + else if (!MySubscription->disableonerr) PG_RE_THROW();
> 
> - PG_RE_THROW();
> 
> Can't we combine these two different checks for
> 'MySubscription->disableonerr' if you do it as a separate if check after sending
> the stats message?
No, we can't. The second check of MySubscription->disableonerr is for the case
apply_error_callback_arg.command equals 0. We disable the subscription
on any errors. In other words, we need to rethrow the error in the case,
if the flag disableonerr is not set to true.

So, moving it to after sending
the stats message can't be done. At the same time, if we move
the disableonerr flag check outside of the apply_error_callback_arg.command condition
branch, we need to write another call of pgstat_report_subworker_error, with the
same arguments that we have now. This wouldn't be preferrable as well.

> 
> 2. Can we move the code related to tablesync worker and its error handing (the
> code insider if (am_tablesync_worker())) to a separate function say
> LogicalRepHandleTableSync() or something like that.
> 
> 3. Similarly, we can move apply-loop related code ("Run the main
> loop.") to a separate function say LogicalRepHandleApplyMessages().
> 
> If we do (2) and (3), I think the code in ApplyWorkerMain will look better. What
> do you think?
I agree with (2) and (3), since those contribute to better readability.

Attached a new patch v17 that addresses those refactorings.


Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, February 15, 2022 2:19 PM I wrote
> On Monday, February 14, 2022 8:58 PM Amit Kapila
> > 2. Can we move the code related to tablesync worker and its error
> > handing (the code insider if (am_tablesync_worker())) to a separate
> > function say
> > LogicalRepHandleTableSync() or something like that.
> >
> > 3. Similarly, we can move apply-loop related code ("Run the main
> > loop.") to a separate function say LogicalRepHandleApplyMessages().
> >
> > If we do (2) and (3), I think the code in ApplyWorkerMain will look
> > better. What do you think?
> I agree with (2) and (3), since those contribute to better readability.
>
> Attached a new patch v17 that addresses those refactorings.
Hi, I noticed that one new tap test was added in the src/test/subscription/
and needed to increment the number of my test of this patch.

Also, I conducted minor fixes of comments and function name.
Kindly have a look at the attached v18.

Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Peter Smith
Date:
Hi. Below are my code review comments for v18.

==========

1. Commit Message - wording

BEFORE
To partially remedy the situation, adding a new subscription_parameter
named 'disable_on_error'.

AFTER
To partially remedy the situation, this patch adds a new
subscription_parameter named 'disable_on_error'.

~~~

2. Commit message - wording

BEFORE
Require to bump catalog version.

AFTER
A catalog version bump is required.

~~~

3. doc/src/sgml/ref/alter_subscription.sgml - whitespace

@@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
       information.  The parameters that can be altered
       are <literal>slot_name</literal>,
       <literal>synchronous_commit</literal>,
-      <literal>binary</literal>, and
-      <literal>streaming</literal>.
+      <literal>binary</literal>,<literal>streaming</literal>, and
+      <literal>disable_on_error</literal>.
      </para>

There is a missing space before <literal>streaming</literal>.

~~~

4. src/backend/replication/logical/worker.c - WorkerErrorRecovery

@@ -2802,6 +2803,89 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 }

 /*
+ * Worker error recovery processing, in preparation for disabling the
+ * subscription.
+ */
+static void
+WorkerErrorRecovery(void)

I was wondering about the need for this to be a separate function? It
is only called immediately before calling 'DisableSubscriptionOnError'
so would it maybe be better just to put this code inside
DisableSubscriptionOnError with the appropriate comments?

~~~

5. src/backend/replication/logical/worker.c - DisableSubscriptionOnError

+ /*
+ * We would not be here unless this subscription's disableonerr field was
+ * true when our worker began applying changes, but check whether that
+ * field has changed in the interim.
+ */

Apparently, this function might just do nothing if it detects some
situation where the flag was changed somehow, but I’m not 100% sure
that the callers are properly catering for when nothing happens.

IMO it would be better if this function would return true/false to
mean "did disable subscription happen or not?" because that will give
the calling code the chance to check the function return and do the
right thing - e.g. if the caller first thought it should be disabled
but then it turned out it did NOT disable...

~~~

6. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync name

+/*
+ * Execute the initial sync with error handling. Disable the subscription,
+ * if it's required.
+ */
+static void
+LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
+   char **myslotname, MemoryContext cctx)

I felt that it is a bit overkill to put a "LogicalRep" prefix here
because it is a static function.

IMO this function should be renamed as 'SyncTableStartWrapper' because
that describes better what it is doing.

~~~

7. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync Assert

Even though we can know this to be true because of where it is called
from, I think the readability of the function will be improved if you
add an assertion at the top:

Assert(am_tablesync_worker());

And then, because the function is clearly for Tablesync worker only
there is no need to keep mentioning that in the subsequent comments...

e.g.1
/* This is table synchronization worker, call initial sync. */
AFTER:
/* Call initial sync. */

e.g.2
/*
 * Report the table sync error. There is no corresponding message type
 * for table synchronization.
 */
AFTER
/*
 * Report the error. There is no corresponding message type for table
 * synchronization.
 */

~~~

8. src/backend/replication/logical/worker.c -
LogicalRepHandleTableSync unnecessarily complex

+static void
+LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
+   char **myslotname, MemoryContext cctx)
+{
+ char    *syncslotname;
+ bool error_recovery_done = false;

IMO this logic is way more complex than it needed to be. IIUC that
'error_recovery_done' and various conditions can be removed, and the
whole thing be simplified quite a lot.

I re-wrote this function as a POC. Please see the attached file [2].
All the tests are still passing OK.

(Perhaps the scenario for my comment #5 above still needs to be addressed?)

~~~

9. src/backend/replication/logical/worker.c - LogicalRepHandleApplyMessages name

+/*
+ * Run the apply loop with error handling. Disable the subscription,
+ * if necessary.
+ */
+static void
+LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
+   MemoryContext cctx)

I felt that it is a bit overkill to put a "LogicalRep" prefix here
because it is a static function.

IMO this function should be renamed as 'ApplyLoopWrapper' because that
describes better what it is doing.

~~~

10. src/backend/replication/logical/worker.c -
LogicalRepHandleApplyMessages unnecessarily complex

+static void
+LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
+   MemoryContext cctx)
+{
+ bool error_recovery_done = false;

IMO this logic is way more complex than it needed to be. IIUC that
'error_recovery_done' and various conditions can be removed, and the
whole thing be simplified quite a lot.

I re-wrote this function as a POC. Please see the attached file [2].
All the tests are still passing OK.

(Perhaps the scenario for my comment #5 above still needs to be addressed?)

~~~

11. src/bin/pg_dump/pg_dump.c - dumpSubscription

@@ -4441,6 +4451,9 @@ dumpSubscription(Archive *fout, const
SubscriptionInfo *subinfo)
  if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
  appendPQExpBufferStr(query, ", two_phase = on");

+ if (strcmp(subinfo->subdisableonerr, "f") != 0)
+ appendPQExpBufferStr(query, ", disable_on_error = on");
+

I felt saying disable_on_err is "true" would look more natural than
saying it is "on".

~~~

12. src/bin/psql/describe.c - describeSubscriptions typo

@@ -6096,11 +6096,13 @@ describeSubscriptions(const char *pattern, bool verbose)
    gettext_noop("Binary"),
    gettext_noop("Streaming"));

- /* Two_phase is only supported in v15 and higher */
+ /* Two_phase and disable_on_error is only supported in v15 and higher */

Typo

"is only" --> "are only"

~~~

13. src/include/catalog/pg_subscription.h - comments

@@ -103,6 +106,9 @@ typedef struct Subscription
  * binary format */
  bool stream; /* Allow streaming in-progress transactions. */
  char twophasestate; /* Allow streaming two-phase transactions */
+ bool disableonerr; /* Indicates if the subscription should be
+ * automatically disabled when subscription
+ * workers detect any errors. */

It's not usual to have a full stop here.
Maybe not needed to repeat the word "subscription".
IMO, generally, it all can be simplified a bit.

BEFORE
Indicates if the subscription should be automatically disabled when
subscription workers detect any errors.

AFTER
Indicates if the subscription should be automatically disabled if a
worker error occurs

~~~

14. src/test/regress/sql/subscription.sql - missing test case.

The "conflicting options" error from the below code is not currently
being tested.

@@ -249,6 +253,15 @@ parse_subscription_options(ParseState *pstate,
List *stmt_options,
  opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
  opts->twophase = defGetBoolean(defel);
  }
+ else if (IsSet(supported_opts, SUBOPT_DISABLE_ON_ERR) &&
+ strcmp(defel->defname, "disable_on_error") == 0)
+ {
+ if (IsSet(opts->specified_opts, SUBOPT_DISABLE_ON_ERR))
+ errorConflictingDefElem(defel, pstate);

~~~

15. src/test/subscription/t/028_disable_on_error.pl - 028 clash

Just a heads-up that this 028 is going to clash with the Row-Filter
patch 028 which has been announced to be pushed soon, so be prepared
to change this number again shortly :)

~~~

16. src/test/subscription/t/028_disable_on_error.pl - done_testing

AFAIK is a new style now for the TAP tests where it uses
"done_testing();" instead of saying up-front how many tests there are.
See here [1].

~~~

17. src/test/subscription/t/028_disable_on_error.pl - more comments

+# Create an additional unique index in schema s1 on the subscriber only.  When
+# we create subscriptions, below, this should cause subscription "s1" on the
+# subscriber to fail during initial synchronization and to get automatically
+# disabled.

I felt it could be made a bit more obvious upfront in a comment that 2
pairs of pub/sub will be created, and their names will same as the
schemas:
e.g.
Publisher "s1" --> Subscriber "s1"
Publisher "s2" --> Subscriber "s2"

~~~

18. src/test/subscription/t/028_disable_on_error.pl - ALTER tests?

The tests here are only using the hardwired 'disable_on_error' options
set at CREATE SUBSCRIPTION time. There are no TAP tests for changing
the disable_on_error using ALTER SUBSCRIPTION.

Should there be?

------
[1] https://github.com/postgres/postgres/commit/549ec201d6132b7c7ee11ee90a4e02119259ba5b
[2] worker.c.peter.txt is same as your v18 worker.c but I re-wrote
functions LogicalRepHandleTableSync and LogicalRepHandleApplyMessages
as POC

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Friday, February 18, 2022 3:27 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Hi. Below are my code review comments for v18.
Thank you for your review !

> ==========
> 
> 1. Commit Message - wording
> 
> BEFORE
> To partially remedy the situation, adding a new subscription_parameter named
> 'disable_on_error'.
> 
> AFTER
> To partially remedy the situation, this patch adds a new
> subscription_parameter named 'disable_on_error'.
Fixed.

> ~~~
> 
> 2. Commit message - wording
> 
> BEFORE
> Require to bump catalog version.
> 
> AFTER
> A catalog version bump is required.
Fixed.

> ~~~
> 
> 3. doc/src/sgml/ref/alter_subscription.sgml - whitespace
> 
> @@ -201,8 +201,8 @@ ALTER SUBSCRIPTION <replaceable
> class="parameter">name</replaceable> RENAME TO <
>        information.  The parameters that can be altered
>        are <literal>slot_name</literal>,
>        <literal>synchronous_commit</literal>,
> -      <literal>binary</literal>, and
> -      <literal>streaming</literal>.
> +      <literal>binary</literal>,<literal>streaming</literal>, and
> +      <literal>disable_on_error</literal>.
>       </para>
> 
> There is a missing space before <literal>streaming</literal>.
Fixed. 



> ~~~
> 
> 4. src/backend/replication/logical/worker.c - WorkerErrorRecovery
> 
> @@ -2802,6 +2803,89 @@ LogicalRepApplyLoop(XLogRecPtr
> last_received)  }
> 
>  /*
> + * Worker error recovery processing, in preparation for disabling the
> + * subscription.
> + */
> +static void
> +WorkerErrorRecovery(void)
> 
> I was wondering about the need for this to be a separate function? It is only
> called immediately before calling 'DisableSubscriptionOnError'
> so would it maybe be better just to put this code inside
> DisableSubscriptionOnError with the appropriate comments?
I preferred to have one specific for error handling,
because from caller sides, when we catch error, it's apparent
that error recovery is done. But, the function name "DisableSubscriptionOnError"
by itself should have the nuance that we do something on error.
So, we can think that it's okay to have error recovery processing
in this function.

So, I removed the function and fixed some related comments.


> ~~~
> 
> 5. src/backend/replication/logical/worker.c - DisableSubscriptionOnError
> 
> + /*
> + * We would not be here unless this subscription's disableonerr field
> + was
> + * true when our worker began applying changes, but check whether that
> + * field has changed in the interim.
> + */
> 
> Apparently, this function might just do nothing if it detects some situation
> where the flag was changed somehow, but I'm not 100% sure that the callers
> are properly catering for when nothing happens.
> 
> IMO it would be better if this function would return true/false to mean "did
> disable subscription happen or not?" because that will give the calling code the
> chance to check the function return and do the right thing - e.g. if the caller first
> thought it should be disabled but then it turned out it did NOT disable...
I don't think we need to do something more.
After this function, table sync worker and the apply worker
just exit. IMO, we don't need to do additional work for
already-disabled subscription on the caller sides.
It should be sufficient to fulfill the purpose of
DisableSubscriptionOnError or confirm it has been fulfilled.


> ~~~
> 
> 6. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync
> name
> 
> +/*
> + * Execute the initial sync with error handling. Disable the
> +subscription,
> + * if it's required.
> + */
> +static void
> +LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
> +   char **myslotname, MemoryContext cctx)
> 
> I felt that it is a bit overkill to put a "LogicalRep" prefix here because it is a static
> function.
> 
> IMO this function should be renamed as 'SyncTableStartWrapper' because that
> describes better what it is doing.
Makes sense. Fixed.


> ~~~
> 
> 7. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync
> Assert
> 
> Even though we can know this to be true because of where it is called from, I
> think the readability of the function will be improved if you add an assertion at
> the top:
> 
> Assert(am_tablesync_worker());
Fixed.

> And then, because the function is clearly for Tablesync worker only there is no
> need to keep mentioning that in the subsequent comments...
> 
> e.g.1
> /* This is table synchronization worker, call initial sync. */
> AFTER:
> /* Call initial sync. */
Fixed.

> e.g.2
> /*
>  * Report the table sync error. There is no corresponding message type
>  * for table synchronization.
>  */
> AFTER
> /*
>  * Report the error. There is no corresponding message type for table
>  * synchronization.
>  */
Agreed. Fixed


> ~~~
> 
> 8. src/backend/replication/logical/worker.c - LogicalRepHandleTableSync
> unnecessarily complex
> 
> +static void
> +LogicalRepHandleTableSync(XLogRecPtr *origin_startpos,
> +   char **myslotname, MemoryContext cctx) {
> + char    *syncslotname;
> + bool error_recovery_done = false;
> 
> IMO this logic is way more complex than it needed to be. IIUC that
> 'error_recovery_done' and various conditions can be removed, and the whole
> thing be simplified quite a lot.
> 
> I re-wrote this function as a POC. Please see the attached file [2].
> All the tests are still passing OK.
> 
> (Perhaps the scenario for my comment #5 above still needs to be addressed?)
Removed the 'error_recovery_done' flag and fixed.



 
> ~~~
> 
> 9. src/backend/replication/logical/worker.c -
> LogicalRepHandleApplyMessages name
> 
> +/*
> + * Run the apply loop with error handling. Disable the subscription,
> + * if necessary.
> + */
> +static void
> +LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
> +   MemoryContext cctx)
> 
> I felt that it is a bit overkill to put a "LogicalRep" prefix here because it is a static
> function.
> 
> IMO this function should be renamed as 'ApplyLoopWrapper' because that
> describes better what it is doing.
Fixed.


> ~~~
> 
> 10. src/backend/replication/logical/worker.c -
> LogicalRepHandleApplyMessages unnecessarily complex
> 
> +static void
> +LogicalRepHandleApplyMessages(XLogRecPtr origin_startpos,
> +   MemoryContext cctx)
> +{
> + bool error_recovery_done = false;
> 
> IMO this logic is way more complex than it needed to be. IIUC that
> 'error_recovery_done' and various conditions can be removed, and the whole
> thing be simplified quite a lot.
> 
> I re-wrote this function as a POC. Please see the attached file [2].
> All the tests are still passing OK.
> 
> (Perhaps the scenario for my comment #5 above still needs to be addressed?)
Fixed.


> ~~~
> 
> 11. src/bin/pg_dump/pg_dump.c - dumpSubscription
> 
> @@ -4441,6 +4451,9 @@ dumpSubscription(Archive *fout, const
> SubscriptionInfo *subinfo)
>   if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
>   appendPQExpBufferStr(query, ", two_phase = on");
> 
> + if (strcmp(subinfo->subdisableonerr, "f") != 0)
> + appendPQExpBufferStr(query, ", disable_on_error = on");
> +
> 
> I felt saying disable_on_err is "true" would look more natural than saying it is
> "on".
Fixed.


> ~~~
> 
> 12. src/bin/psql/describe.c - describeSubscriptions typo
> 
> @@ -6096,11 +6096,13 @@ describeSubscriptions(const char *pattern, bool
> verbose)
>     gettext_noop("Binary"),
>     gettext_noop("Streaming"));
> 
> - /* Two_phase is only supported in v15 and higher */
> + /* Two_phase and disable_on_error is only supported in v15 and higher
> + */
> 
> Typo
> 
> "is only" --> "are only"
Fixed.


> ~~~
> 
> 13. src/include/catalog/pg_subscription.h - comments
> 
> @@ -103,6 +106,9 @@ typedef struct Subscription
>   * binary format */
>   bool stream; /* Allow streaming in-progress transactions. */
>   char twophasestate; /* Allow streaming two-phase transactions */
> + bool disableonerr; /* Indicates if the subscription should be
> + * automatically disabled when subscription
> + * workers detect any errors. */
> 
> It's not usual to have a full stop here.
> Maybe not needed to repeat the word "subscription".
> IMO, generally, it all can be simplified a bit.
> 
> BEFORE
> Indicates if the subscription should be automatically disabled when
> subscription workers detect any errors.
> 
> AFTER
> Indicates if the subscription should be automatically disabled if a worker error
> occurs
Fixed.


> ~~~
> 
> 14. src/test/regress/sql/subscription.sql - missing test case.
> 
> The "conflicting options" error from the below code is not currently being
> tested.
> 
> @@ -249,6 +253,15 @@ parse_subscription_options(ParseState *pstate, List
> *stmt_options,
>   opts->specified_opts |= SUBOPT_TWOPHASE_COMMIT;
>   opts->twophase = defGetBoolean(defel);
>   }
> + else if (IsSet(supported_opts, SUBOPT_DISABLE_ON_ERR) &&
> + strcmp(defel->defname, "disable_on_error") == 0) { if
> + (IsSet(opts->specified_opts, SUBOPT_DISABLE_ON_ERR))
> + errorConflictingDefElem(defel, pstate);
We don't have this test in other options as well.
So, this should be aligned.


> ~~~
> 
> 15. src/test/subscription/t/028_disable_on_error.pl - 028 clash
> 
> Just a heads-up that this 028 is going to clash with the Row-Filter patch 028
> which has been announced to be pushed soon, so be prepared to change this
> number again shortly :)
Thank you for letting me know.


> ~~~
> 
> 16. src/test/subscription/t/028_disable_on_error.pl - done_testing
> 
> AFAIK is a new style now for the TAP tests where it uses "done_testing();"
> instead of saying up-front how many tests there are.
> See here [1].
Fixed.

> ~~~
> 
> 17. src/test/subscription/t/028_disable_on_error.pl - more comments
> 
> +# Create an additional unique index in schema s1 on the subscriber
> +only.  When # we create subscriptions, below, this should cause
> +subscription "s1" on the # subscriber to fail during initial
> +synchronization and to get automatically # disabled.
> 
> I felt it could be made a bit more obvious upfront in a comment that 2 pairs of
> pub/sub will be created, and their names will same as the
> schemas:
> e.g.
> Publisher "s1" --> Subscriber "s1"
> Publisher "s2" --> Subscriber "s2"
Comments are fixed.


> ~~~
> 
> 18. src/test/subscription/t/028_disable_on_error.pl - ALTER tests?
> 
> The tests here are only using the hardwired 'disable_on_error' options set at
> CREATE SUBSCRIPTION time. There are no TAP tests for changing the
> disable_on_error using ALTER SUBSCRIPTION.
> 
> Should there be?
I don't think so. Toggling the flag 'disable_on_error' is already tested
in the subscription.sql file. Both new paths for table sync and apply
worker to disable on error are already covered.


FYI : I skipped one change of worker.c.peter.txt
about "enabled" flag, which is independent from
disable_on_error option.

Kindly have a look at the attached v19.


Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Peter Smith
Date:
Thanks for addressing my previous comments. Now I have looked at v19.

On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Friday, February 18, 2022 3:27 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > Hi. Below are my code review comments for v18.
> Thank you for your review !
...
> > 5. src/backend/replication/logical/worker.c - DisableSubscriptionOnError
> >
> > + /*
> > + * We would not be here unless this subscription's disableonerr field
> > + was
> > + * true when our worker began applying changes, but check whether that
> > + * field has changed in the interim.
> > + */
> >
> > Apparently, this function might just do nothing if it detects some situation
> > where the flag was changed somehow, but I'm not 100% sure that the callers
> > are properly catering for when nothing happens.
> >
> > IMO it would be better if this function would return true/false to mean "did
> > disable subscription happen or not?" because that will give the calling code the
> > chance to check the function return and do the right thing - e.g. if the caller first
> > thought it should be disabled but then it turned out it did NOT disable...
> I don't think we need to do something more.
> After this function, table sync worker and the apply worker
> just exit. IMO, we don't need to do additional work for
> already-disabled subscription on the caller sides.
> It should be sufficient to fulfill the purpose of
> DisableSubscriptionOnError or confirm it has been fulfilled.

Hmmm -  Yeah, it may be the workers might just exit soon after anyhow
as you say so everything comes out in the wash, but still, I felt for
this case when DisableSubscriptionOnError turned out to do nothing it
would be better to exit via the existing logic. And that is easy to do
if the function returns true/false.

For example, changes like below seemed neater code to me. YMMV.

BEFORE (SyncTableStartWrapper):
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
proc_exit(0);

BEFORE (ApplyLoopWrapper)
if (MySubscription->disableonerr)
{
/* Disable the subscription */
DisableSubscriptionOnError();
return;
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
return;

~~~

Here are a couple more comments:

1. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError, Refactor error handling

(this comment assumes the above gets changed too)

+static void
+DisableSubscriptionOnError(void)
+{
+ Relation rel;
+ bool nulls[Natts_pg_subscription];
+ bool replaces[Natts_pg_subscription];
+ Datum values[Natts_pg_subscription];
+ HeapTuple tup;
+ Form_pg_subscription subform;
+
+ /* Emit the error */
+ EmitErrorReport();
+ /* Abort any active transaction */
+ AbortOutOfAnyTransaction();
+ /* Reset the ErrorContext */
+ FlushErrorState();
+
+ /* Disable the subscription in a fresh transaction */
+ StartTransactionCommand();

If this DisableSubscriptionOnError function decides later that
actually the 'disableonerr' flag is false (i.e. it's NOT going to
disable the subscription after all) then IMO it make more sense that
the error logging for that case should just do whatever it is doing
now by the normal error processing mechanism.

In other words, I thought perhaps the code to
EmitErrorReport/FlushError state etc be moved to be BELOW the if
(!subform->subdisableonerr) bail-out code?

Please see what you think in my attached POC [1]. It seems neater to
me, and tests are all OK. Maybe I am mistaken...

~~~

2. Commit message - wording

Logical replication apply workers for a subscription can easily get
stuck in an infinite loop of attempting to apply a change,
triggering an error (such as a constraint violation), exiting with
an error written to the subscription worker log, and restarting.

SUGGESTION
"exiting with an error written" --> "exiting with the error written"

------
[1] peter-v19-poc.diff - POC just to try some of my suggestions above
to make sure all tests still pass ok.

Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, February 21, 2022 2:56 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Thanks for addressing my previous comments. Now I have looked at v19.
> 
> On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Friday, February 18, 2022 3:27 PM Peter Smith
> <smithpb2250@gmail.com> wrote:
> > > Hi. Below are my code review comments for v18.
> > Thank you for your review !
> ...
> > > 5. src/backend/replication/logical/worker.c -
> > > DisableSubscriptionOnError
> > >
> > > + /*
> > > + * We would not be here unless this subscription's disableonerr
> > > + field was
> > > + * true when our worker began applying changes, but check whether
> > > + that
> > > + * field has changed in the interim.
> > > + */
> > >
> > > Apparently, this function might just do nothing if it detects some
> > > situation where the flag was changed somehow, but I'm not 100% sure
> > > that the callers are properly catering for when nothing happens.
> > >
> > > IMO it would be better if this function would return true/false to
> > > mean "did disable subscription happen or not?" because that will
> > > give the calling code the chance to check the function return and do
> > > the right thing - e.g. if the caller first thought it should be disabled but then
> it turned out it did NOT disable...
> > I don't think we need to do something more.
> > After this function, table sync worker and the apply worker just exit.
> > IMO, we don't need to do additional work for already-disabled
> > subscription on the caller sides.
> > It should be sufficient to fulfill the purpose of
> > DisableSubscriptionOnError or confirm it has been fulfilled.
> 
> Hmmm -  Yeah, it may be the workers might just exit soon after anyhow as you
> say so everything comes out in the wash, but still, I felt for this case when
> DisableSubscriptionOnError turned out to do nothing it would be better to exit
> via the existing logic. And that is easy to do if the function returns true/false.
> 
> For example, changes like below seemed neater code to me. YMMV.
> 
> BEFORE (SyncTableStartWrapper):
> if (MySubscription->disableonerr)
> {
> DisableSubscriptionOnError();
> proc_exit(0);
> }
> AFTER
> if (MySubscription->disableonerr && DisableSubscriptionOnError())
> proc_exit(0);
> 
> BEFORE (ApplyLoopWrapper)
> if (MySubscription->disableonerr)
> {
> /* Disable the subscription */
> DisableSubscriptionOnError();
> return;
> }
> AFTER
> if (MySubscription->disableonerr && DisableSubscriptionOnError()) return;
Okay, so this return value works for better readability.
Fixed.

 
> ~~~
> 
> Here are a couple more comments:
> 
> 1. src/backend/replication/logical/worker.c - DisableSubscriptionOnError,
> Refactor error handling
> 
> (this comment assumes the above gets changed too)
I think those are independent.


> +static void
> +DisableSubscriptionOnError(void)
> +{
> + Relation rel;
> + bool nulls[Natts_pg_subscription];
> + bool replaces[Natts_pg_subscription];
> + Datum values[Natts_pg_subscription];
> + HeapTuple tup;
> + Form_pg_subscription subform;
> +
> + /* Emit the error */
> + EmitErrorReport();
> + /* Abort any active transaction */
> + AbortOutOfAnyTransaction();
> + /* Reset the ErrorContext */
> + FlushErrorState();
> +
> + /* Disable the subscription in a fresh transaction */
> + StartTransactionCommand();
> 
> If this DisableSubscriptionOnError function decides later that actually the
> 'disableonerr' flag is false (i.e. it's NOT going to disable the subscription after
> all) then IMO it make more sense that the error logging for that case should just
> do whatever it is doing now by the normal error processing mechanism.
> 
> In other words, I thought perhaps the code to EmitErrorReport/FlushError state
> etc be moved to be BELOW the if
> (!subform->subdisableonerr) bail-out code?
> 
> Please see what you think in my attached POC [1]. It seems neater to me, and
> tests are all OK. Maybe I am mistaken...
I had a concern that this order change of codes would have a negative
impact when we have another new error during the call of DisableSubscriptionOnError.

With the debugger, I raised an error in this function before emitting the original error.
As a result, the original error that makes the apply worker go into the path of
DisableSubscriptionOnError (in my test, duplication error) has vanished.
In this sense, v19 looks safer, and the current order to handle error recovery first
looks better to me.

FYI, after the 2nd debugger error,
the next new apply worker created quickly met the same type of error,
went into the same path, and disabled the subscription with the log.
But, it won't be advisable to let the possibility left.

> ~~~
> 
> 2. Commit message - wording
> 
> Logical replication apply workers for a subscription can easily get stuck in an
> infinite loop of attempting to apply a change, triggering an error (such as a
> constraint violation), exiting with an error written to the subscription worker log,
> and restarting.
> 
> SUGGESTION
> "exiting with an error written" --> "exiting with the error written"
Fixed.

 
> ------
> [1] peter-v19-poc.diff - POC just to try some of my suggestions above to make
> sure all tests still pass ok.
Thanks ! I included you as co-author, because
you shared meaningful patches for me.


Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Peter Smith
Date:
On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, February 21, 2022 2:56 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > Thanks for addressing my previous comments. Now I have looked at v19.
> >
> > On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Friday, February 18, 2022 3:27 PM Peter Smith
> > <smithpb2250@gmail.com> wrote:
> > > > Hi. Below are my code review comments for v18.
> > > Thank you for your review !
> > ...
> > > > 5. src/backend/replication/logical/worker.c -
> > > > DisableSubscriptionOnError
> > > >
> > > > + /*
> > > > + * We would not be here unless this subscription's disableonerr
> > > > + field was
> > > > + * true when our worker began applying changes, but check whether
> > > > + that
> > > > + * field has changed in the interim.
> > > > + */
> > > >
> > > > Apparently, this function might just do nothing if it detects some
> > > > situation where the flag was changed somehow, but I'm not 100% sure
> > > > that the callers are properly catering for when nothing happens.
> > > >
> > > > IMO it would be better if this function would return true/false to
> > > > mean "did disable subscription happen or not?" because that will
> > > > give the calling code the chance to check the function return and do
> > > > the right thing - e.g. if the caller first thought it should be disabled but then
> > it turned out it did NOT disable...
> > > I don't think we need to do something more.
> > > After this function, table sync worker and the apply worker just exit.
> > > IMO, we don't need to do additional work for already-disabled
> > > subscription on the caller sides.
> > > It should be sufficient to fulfill the purpose of
> > > DisableSubscriptionOnError or confirm it has been fulfilled.
> >
> > Hmmm -  Yeah, it may be the workers might just exit soon after anyhow as you
> > say so everything comes out in the wash, but still, I felt for this case when
> > DisableSubscriptionOnError turned out to do nothing it would be better to exit
> > via the existing logic. And that is easy to do if the function returns true/false.
> >
> > For example, changes like below seemed neater code to me. YMMV.
> >
> > BEFORE (SyncTableStartWrapper):
> > if (MySubscription->disableonerr)
> > {
> > DisableSubscriptionOnError();
> > proc_exit(0);
> > }
> > AFTER
> > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > proc_exit(0);
> >
> > BEFORE (ApplyLoopWrapper)
> > if (MySubscription->disableonerr)
> > {
> > /* Disable the subscription */
> > DisableSubscriptionOnError();
> > return;
> > }
> > AFTER
> > if (MySubscription->disableonerr && DisableSubscriptionOnError()) return;
> Okay, so this return value works for better readability.
> Fixed.
>
>
> > ~~~
> >
> > Here are a couple more comments:
> >
> > 1. src/backend/replication/logical/worker.c - DisableSubscriptionOnError,
> > Refactor error handling
> >
> > (this comment assumes the above gets changed too)
> I think those are independent.

OK. I was only curious if the change #5 above might cause the error to
be logged 2x, if the DisableSubscriptionOnError returns false.
- firstly, when it logs errors within the function
- secondly, by normal error mechanism when the caller re-throws it.

But, if you are sure that won't happen then it is good news.

>
>
> > +static void
> > +DisableSubscriptionOnError(void)
> > +{
> > + Relation rel;
> > + bool nulls[Natts_pg_subscription];
> > + bool replaces[Natts_pg_subscription];
> > + Datum values[Natts_pg_subscription];
> > + HeapTuple tup;
> > + Form_pg_subscription subform;
> > +
> > + /* Emit the error */
> > + EmitErrorReport();
> > + /* Abort any active transaction */
> > + AbortOutOfAnyTransaction();
> > + /* Reset the ErrorContext */
> > + FlushErrorState();
> > +
> > + /* Disable the subscription in a fresh transaction */
> > + StartTransactionCommand();
> >
> > If this DisableSubscriptionOnError function decides later that actually the
> > 'disableonerr' flag is false (i.e. it's NOT going to disable the subscription after
> > all) then IMO it make more sense that the error logging for that case should just
> > do whatever it is doing now by the normal error processing mechanism.
> >
> > In other words, I thought perhaps the code to EmitErrorReport/FlushError state
> > etc be moved to be BELOW the if
> > (!subform->subdisableonerr) bail-out code?
> >
> > Please see what you think in my attached POC [1]. It seems neater to me, and
> > tests are all OK. Maybe I am mistaken...
> I had a concern that this order change of codes would have a negative
> impact when we have another new error during the call of DisableSubscriptionOnError.
>
> With the debugger, I raised an error in this function before emitting the original error.
> As a result, the original error that makes the apply worker go into the path of
> DisableSubscriptionOnError (in my test, duplication error) has vanished.
> In this sense, v19 looks safer, and the current order to handle error recovery first
> looks better to me.
>
> FYI, after the 2nd debugger error,
> the next new apply worker created quickly met the same type of error,
> went into the same path, and disabled the subscription with the log.
> But, it won't be advisable to let the possibility left.

OK - thanks for checking it.

Will it be better to put some comments about that? Something like --

BEFORE
/* Emit the error */
EmitErrorReport();
/* Abort any active transaction */
AbortOutOfAnyTransaction();
/* Reset the ErrorContext */
FlushErrorState();

/* Disable the subscription in a fresh transaction */
StartTransactionCommand();

AFTER
/* Disable the subscription in a fresh transaction */
AbortOutOfAnyTransaction();
StartTransactionCommand();

/*
* Log the error that caused DisableSubscriptionOnError to be called.
* We do this immediately so that it won't be lost if some other internal
* error occurs in the following code,
*/
EmitErrorReport();
FlushErrorState();

...
>
> > ------
> > [1] peter-v19-poc.diff - POC just to try some of my suggestions above to make
> > sure all tests still pass ok.
> Thanks ! I included you as co-author, because
> you shared meaningful patches for me.
>

Thanks!

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, February 22, 2022 7:53 AM Peter Smith <smithpb2250@gmail.com> wrote:
> On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Monday, February 21, 2022 2:56 PM Peter Smith
> <smithpb2250@gmail.com> wrote:
> > > Thanks for addressing my previous comments. Now I have looked at v19.
> > >
> > > On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > On Friday, February 18, 2022 3:27 PM Peter Smith
> > > <smithpb2250@gmail.com> wrote:
> > > > > Hi. Below are my code review comments for v18.
> > > > Thank you for your review !
> > > ...
> > > > > 5. src/backend/replication/logical/worker.c -
> > > > > DisableSubscriptionOnError
> > > > >
> > > > > + /*
> > > > > + * We would not be here unless this subscription's disableonerr
> > > > > + field was
> > > > > + * true when our worker began applying changes, but check
> > > > > + whether that
> > > > > + * field has changed in the interim.
> > > > > + */
> > > > >
> > > > > Apparently, this function might just do nothing if it detects
> > > > > some situation where the flag was changed somehow, but I'm not
> > > > > 100% sure that the callers are properly catering for when nothing
> happens.
> > > > >
> > > > > IMO it would be better if this function would return true/false
> > > > > to mean "did disable subscription happen or not?" because that
> > > > > will give the calling code the chance to check the function
> > > > > return and do the right thing - e.g. if the caller first thought
> > > > > it should be disabled but then
> > > it turned out it did NOT disable...
> > > > I don't think we need to do something more.
> > > > After this function, table sync worker and the apply worker just exit.
> > > > IMO, we don't need to do additional work for already-disabled
> > > > subscription on the caller sides.
> > > > It should be sufficient to fulfill the purpose of
> > > > DisableSubscriptionOnError or confirm it has been fulfilled.
> > >
> > > Hmmm -  Yeah, it may be the workers might just exit soon after
> > > anyhow as you say so everything comes out in the wash, but still, I
> > > felt for this case when DisableSubscriptionOnError turned out to do
> > > nothing it would be better to exit via the existing logic. And that is easy to do
> if the function returns true/false.
> > >
> > > For example, changes like below seemed neater code to me. YMMV.
> > >
> > > BEFORE (SyncTableStartWrapper):
> > > if (MySubscription->disableonerr)
> > > {
> > > DisableSubscriptionOnError();
> > > proc_exit(0);
> > > }
> > > AFTER
> > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > proc_exit(0);
> > >
> > > BEFORE (ApplyLoopWrapper)
> > > if (MySubscription->disableonerr)
> > > {
> > > /* Disable the subscription */
> > > DisableSubscriptionOnError();
> > > return;
> > > }
> > > AFTER
> > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > return;
> > Okay, so this return value works for better readability.
> > Fixed.
> >
> >
> > > ~~~
> > >
> > > Here are a couple more comments:
> > >
> > > 1. src/backend/replication/logical/worker.c -
> > > DisableSubscriptionOnError, Refactor error handling
> > >
> > > (this comment assumes the above gets changed too)
> > I think those are independent.
> 
> OK. I was only curious if the change #5 above might cause the error to be logged
> 2x, if the DisableSubscriptionOnError returns false.
> - firstly, when it logs errors within the function
> - secondly, by normal error mechanism when the caller re-throws it.
> 
> But, if you are sure that won't happen then it is good news.
I didn't feel this would become a substantial issue.

When we alter subscription with disable_on_error = false
after we go into the DisableSubscriptionOnError,
we don't disable the subscription in the same function.
That means we launch new apply workers repeatedly after that
until we solve the error cause or we set the disable_on_error = true again.

So, if we confirm that the disable_on_error = false in the DisableSubscriptionOnError,
it's highly possible that we'll get more same original errors from new apply workers.

This leads to another question, we should suppress the 2nd error(if there is),
even when it's highly possible that we'll get more same errors by new apply workers
created repeatedly or not. I wasn't sure if the implementation complexity for this wins the log print.

So, kindly let me keep the current code as is.
If someone wants share his/her opinion on this, please let me know.
 
> >
> >
> > > +static void
> > > +DisableSubscriptionOnError(void)
> > > +{
> > > + Relation rel;
> > > + bool nulls[Natts_pg_subscription];  bool
> > > +replaces[Natts_pg_subscription];  Datum
> > > +values[Natts_pg_subscription];  HeapTuple tup;
> > > +Form_pg_subscription subform;
> > > +
> > > + /* Emit the error */
> > > + EmitErrorReport();
> > > + /* Abort any active transaction */ AbortOutOfAnyTransaction();
> > > + /* Reset the ErrorContext */
> > > + FlushErrorState();
> > > +
> > > + /* Disable the subscription in a fresh transaction */
> > > + StartTransactionCommand();
> > >
> > > If this DisableSubscriptionOnError function decides later that
> > > actually the 'disableonerr' flag is false (i.e. it's NOT going to
> > > disable the subscription after
> > > all) then IMO it make more sense that the error logging for that
> > > case should just do whatever it is doing now by the normal error processing
> mechanism.
> > >
> > > In other words, I thought perhaps the code to
> > > EmitErrorReport/FlushError state etc be moved to be BELOW the if
> > > (!subform->subdisableonerr) bail-out code?
> > >
> > > Please see what you think in my attached POC [1]. It seems neater to
> > > me, and tests are all OK. Maybe I am mistaken...
> > I had a concern that this order change of codes would have a negative
> > impact when we have another new error during the call of
> DisableSubscriptionOnError.
> >
> > With the debugger, I raised an error in this function before emitting the
> original error.
> > As a result, the original error that makes the apply worker go into
> > the path of DisableSubscriptionOnError (in my test, duplication error) has
> vanished.
> > In this sense, v19 looks safer, and the current order to handle error
> > recovery first looks better to me.
> >
> > FYI, after the 2nd debugger error,
> > the next new apply worker created quickly met the same type of error,
> > went into the same path, and disabled the subscription with the log.
> > But, it won't be advisable to let the possibility left.
> 
> OK - thanks for checking it.
> 
> Will it be better to put some comments about that? Something like --
> 
> BEFORE
> /* Emit the error */
> EmitErrorReport();
> /* Abort any active transaction */
> AbortOutOfAnyTransaction();
> /* Reset the ErrorContext */
> FlushErrorState();
> 
> /* Disable the subscription in a fresh transaction */
> StartTransactionCommand();
> 
> AFTER
> /* Disable the subscription in a fresh transaction */
> AbortOutOfAnyTransaction(); StartTransactionCommand();
> 
> /*
> * Log the error that caused DisableSubscriptionOnError to be called.
> * We do this immediately so that it won't be lost if some other internal
> * error occurs in the following code,
> */
> EmitErrorReport();
> FlushErrorState();
I appreciate your suggestion. Yet, I'd like to keep the current order of my patch.
The FlushErrorState's comment mentions we are not out of the error subsystem until
we call this and starting a new transaction before it didn't sound a good idea.
But, I've fixed the comments around this. The indentation for new comments
are checked by pgindent.


Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Peter Smith
Date:
On Tue, Feb 22, 2022 at 3:11 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Tuesday, February 22, 2022 7:53 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > On Mon, Feb 21, 2022 at 11:44 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Monday, February 21, 2022 2:56 PM Peter Smith
> > <smithpb2250@gmail.com> wrote:
> > > > Thanks for addressing my previous comments. Now I have looked at v19.
> > > >
> > > > On Mon, Feb 21, 2022 at 11:25 AM osumi.takamichi@fujitsu.com
> > > > <osumi.takamichi@fujitsu.com> wrote:
> > > > >
> > > > > On Friday, February 18, 2022 3:27 PM Peter Smith
> > > > <smithpb2250@gmail.com> wrote:
> > > > > > Hi. Below are my code review comments for v18.
> > > > > Thank you for your review !
> > > > ...
> > > > > > 5. src/backend/replication/logical/worker.c -
> > > > > > DisableSubscriptionOnError
> > > > > >
> > > > > > + /*
> > > > > > + * We would not be here unless this subscription's disableonerr
> > > > > > + field was
> > > > > > + * true when our worker began applying changes, but check
> > > > > > + whether that
> > > > > > + * field has changed in the interim.
> > > > > > + */
> > > > > >
> > > > > > Apparently, this function might just do nothing if it detects
> > > > > > some situation where the flag was changed somehow, but I'm not
> > > > > > 100% sure that the callers are properly catering for when nothing
> > happens.
> > > > > >
> > > > > > IMO it would be better if this function would return true/false
> > > > > > to mean "did disable subscription happen or not?" because that
> > > > > > will give the calling code the chance to check the function
> > > > > > return and do the right thing - e.g. if the caller first thought
> > > > > > it should be disabled but then
> > > > it turned out it did NOT disable...
> > > > > I don't think we need to do something more.
> > > > > After this function, table sync worker and the apply worker just exit.
> > > > > IMO, we don't need to do additional work for already-disabled
> > > > > subscription on the caller sides.
> > > > > It should be sufficient to fulfill the purpose of
> > > > > DisableSubscriptionOnError or confirm it has been fulfilled.
> > > >
> > > > Hmmm -  Yeah, it may be the workers might just exit soon after
> > > > anyhow as you say so everything comes out in the wash, but still, I
> > > > felt for this case when DisableSubscriptionOnError turned out to do
> > > > nothing it would be better to exit via the existing logic. And that is easy to do
> > if the function returns true/false.
> > > >
> > > > For example, changes like below seemed neater code to me. YMMV.
> > > >
> > > > BEFORE (SyncTableStartWrapper):
> > > > if (MySubscription->disableonerr)
> > > > {
> > > > DisableSubscriptionOnError();
> > > > proc_exit(0);
> > > > }
> > > > AFTER
> > > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > > proc_exit(0);
> > > >
> > > > BEFORE (ApplyLoopWrapper)
> > > > if (MySubscription->disableonerr)
> > > > {
> > > > /* Disable the subscription */
> > > > DisableSubscriptionOnError();
> > > > return;
> > > > }
> > > > AFTER
> > > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > > return;
> > > Okay, so this return value works for better readability.
> > > Fixed.
> > >
> > >
> > > > ~~~
> > > >
> > > > Here are a couple more comments:
> > > >
> > > > 1. src/backend/replication/logical/worker.c -
> > > > DisableSubscriptionOnError, Refactor error handling
> > > >
> > > > (this comment assumes the above gets changed too)
> > > I think those are independent.
> >
> > OK. I was only curious if the change #5 above might cause the error to be logged
> > 2x, if the DisableSubscriptionOnError returns false.
> > - firstly, when it logs errors within the function
> > - secondly, by normal error mechanism when the caller re-throws it.
> >
> > But, if you are sure that won't happen then it is good news.
> I didn't feel this would become a substantial issue.
>
> When we alter subscription with disable_on_error = false
> after we go into the DisableSubscriptionOnError,
> we don't disable the subscription in the same function.
> That means we launch new apply workers repeatedly after that
> until we solve the error cause or we set the disable_on_error = true again.
>
> So, if we confirm that the disable_on_error = false in the DisableSubscriptionOnError,
> it's highly possible that we'll get more same original errors from new apply workers.
>
> This leads to another question, we should suppress the 2nd error(if there is),
> even when it's highly possible that we'll get more same errors by new apply workers
> created repeatedly or not. I wasn't sure if the implementation complexity for this wins the log print.
>
> So, kindly let me keep the current code as is.
> If someone wants share his/her opinion on this, please let me know.

OK, but is it really correct that this scenario can happen "When we
alter subscription with disable_on_error = false after we go into the
DisableSubscriptionOnError". Actually, I thought this window may be
much bigger than that - e.g. maybe we changed the option to false at
*any* time after the worker was originally started and the original
option values were got by GetSubscription function (and that might be
hours/days/weeks ago since it started).

>
> > >
> > >
> > > > +static void
> > > > +DisableSubscriptionOnError(void)
> > > > +{
> > > > + Relation rel;
> > > > + bool nulls[Natts_pg_subscription];  bool
> > > > +replaces[Natts_pg_subscription];  Datum
> > > > +values[Natts_pg_subscription];  HeapTuple tup;
> > > > +Form_pg_subscription subform;
> > > > +
> > > > + /* Emit the error */
> > > > + EmitErrorReport();
> > > > + /* Abort any active transaction */ AbortOutOfAnyTransaction();
> > > > + /* Reset the ErrorContext */
> > > > + FlushErrorState();
> > > > +
> > > > + /* Disable the subscription in a fresh transaction */
> > > > + StartTransactionCommand();
> > > >
> > > > If this DisableSubscriptionOnError function decides later that
> > > > actually the 'disableonerr' flag is false (i.e. it's NOT going to
> > > > disable the subscription after
> > > > all) then IMO it make more sense that the error logging for that
> > > > case should just do whatever it is doing now by the normal error processing
> > mechanism.
> > > >
> > > > In other words, I thought perhaps the code to
> > > > EmitErrorReport/FlushError state etc be moved to be BELOW the if
> > > > (!subform->subdisableonerr) bail-out code?
> > > >
> > > > Please see what you think in my attached POC [1]. It seems neater to
> > > > me, and tests are all OK. Maybe I am mistaken...
> > > I had a concern that this order change of codes would have a negative
> > > impact when we have another new error during the call of
> > DisableSubscriptionOnError.
> > >
> > > With the debugger, I raised an error in this function before emitting the
> > original error.
> > > As a result, the original error that makes the apply worker go into
> > > the path of DisableSubscriptionOnError (in my test, duplication error) has
> > vanished.
> > > In this sense, v19 looks safer, and the current order to handle error
> > > recovery first looks better to me.
> > >
> > > FYI, after the 2nd debugger error,
> > > the next new apply worker created quickly met the same type of error,
> > > went into the same path, and disabled the subscription with the log.
> > > But, it won't be advisable to let the possibility left.
> >
> > OK - thanks for checking it.
> >
> > Will it be better to put some comments about that? Something like --
> >
> > BEFORE
> > /* Emit the error */
> > EmitErrorReport();
> > /* Abort any active transaction */
> > AbortOutOfAnyTransaction();
> > /* Reset the ErrorContext */
> > FlushErrorState();
> >
> > /* Disable the subscription in a fresh transaction */
> > StartTransactionCommand();
> >
> > AFTER
> > /* Disable the subscription in a fresh transaction */
> > AbortOutOfAnyTransaction(); StartTransactionCommand();
> >
> > /*
> > * Log the error that caused DisableSubscriptionOnError to be called.
> > * We do this immediately so that it won't be lost if some other internal
> > * error occurs in the following code,
> > */
> > EmitErrorReport();
> > FlushErrorState();
> I appreciate your suggestion. Yet, I'd like to keep the current order of my patch.
> The FlushErrorState's comment mentions we are not out of the error subsystem until
> we call this and starting a new transaction before it didn't sound a good idea.
> But, I've fixed the comments around this. The indentation for new comments
> are checked by pgindent.

OK.

======

Here are a couple more review comments for v21.

~~~

1. worker.c - comment

+ subform = (Form_pg_subscription) GETSTRUCT(tup);
+
+ /*
+ * We would not be here unless this subscription's disableonerr field was
+ * true, but check whether that field has changed in the interim.
+ */
+ if (!subform->subdisableonerr)
+ {
+ heap_freetuple(tup);
+ table_close(rel, RowExclusiveLock);
+ CommitTransactionCommand();
+ return false;
+ }

I felt that comment belongs above the subform assignment because that
is the only reason we are getting the subform again.

~~

2. worker.c - subform->oid

+ /* Notify the subscription will be no longer valid */
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" will be disabled due
to an error",
+    MySubscription->name));
+
+ LockSharedObject(SubscriptionRelationId, subform->oid, 0,
AccessExclusiveLock);

Can't we just use MySubscription->oid here? We really only needed that
subform to get new option values.

~~

3. worker.c - whitespace

Your pg_indent has also changed some whitespace for parts of worker.c
that are completely unrelated to this patch. You might want to revert
those changes.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"tanghy.fnst@fujitsu.com"
Date:
Hi Osumi-san,

I have a comment on v21 patch.

I wonder if we really need subscription s2 in 028_disable_on_error.pl. I think
for subscription s2, we only tested some normal cases(which could be tested with s1), 
and didn't test any error case, which means it wouldn't be automatically disabled. 
Is there any reason for creating subscription s2?

Regards,
Tang

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, February 23, 2022 6:52 PM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> I have a comment on v21 patch.
> 
> I wonder if we really need subscription s2 in 028_disable_on_error.pl. I think for
> subscription s2, we only tested some normal cases(which could be tested with
> s1), and didn't test any error case, which means it wouldn't be automatically
> disabled.
> Is there any reason for creating subscription s2?
Hi, thank you for your review !

It's for checking there's no impact/influence when disabling one subscription
on the other subscription if any.

*But*, when I have a look at the past tests to add options (e.g. streaming,
two_phase), we don't have this kind of test that I have for disable_on_error patch.
Therefore, I'd like to fix the test as you suggested in my next version.


Best Regards,
    Takamichi Osumi


Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Tue, Feb 22, 2022 at 3:03 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> ~~~
>
> 1. worker.c - comment
>
> + subform = (Form_pg_subscription) GETSTRUCT(tup);
> +
> + /*
> + * We would not be here unless this subscription's disableonerr field was
> + * true, but check whether that field has changed in the interim.
> + */
> + if (!subform->subdisableonerr)
> + {
> + heap_freetuple(tup);
> + table_close(rel, RowExclusiveLock);
> + CommitTransactionCommand();
> + return false;
> + }
>
> I felt that comment belongs above the subform assignment because that
> is the only reason we are getting the subform again.

IIUC if we return false here, the same error will be emitted twice.
And I'm not sure this check is really necessary. It would work only
when the subdisableonerr is set to false concurrently, but doesn't
work for the opposite caces. I think we can check
MySubscription->disableonerr and then just update the tuple.

Here are some comments:

Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?

---
+       /*
+        * Log the error that caused DisableSubscriptionOnError to be called. We
+        * do this immediately so that it won't be lost if some other internal
+        * error occurs in the following code.
+        */
+       EmitErrorReport();
+       AbortOutOfAnyTransaction();
+       FlushErrorState();

Do we need to hold interrupts during cleanup here?

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Here are some comments:
>
> Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?
>

I have given this comment to move the related code to separate
functions to slightly simplify ApplyWorkerMain() code but if you don't
like we can move it back. I am not sure I like the new function names
in the patch though.

> ---
> +       /*
> +        * Log the error that caused DisableSubscriptionOnError to be called. We
> +        * do this immediately so that it won't be lost if some other internal
> +        * error occurs in the following code.
> +        */
> +       EmitErrorReport();
> +       AbortOutOfAnyTransaction();
> +       FlushErrorState();
>
> Do we need to hold interrupts during cleanup here?
>

I think so. We do prevent interrupts via
HOLD_INTERRUPTS/RESUME_INTERRUPTS during cleanup.


-- 
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Thu, Feb 24, 2022 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Here are some comments:
> >
> > Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?
> >
>
> I have given this comment to move the related code to separate
> functions to slightly simplify ApplyWorkerMain() code but if you don't
> like we can move it back. I am not sure I like the new function names
> in the patch though.

Okay, I'm fine with moving this code but perhaps we can find a better
function name as "Wrapper" seems slightly odd to me. For example,
start_table_sync_start() and start_apply_changes() or something (it
seems we use the snake case for static functions in worker.c).

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Optionally automatically disable logical replication subscriptions on error

From
Amit Kapila
Date:
On Thu, Feb 24, 2022 at 6:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 24, 2022 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Here are some comments:
> > >
> > > Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?
> > >
> >
> > I have given this comment to move the related code to separate
> > functions to slightly simplify ApplyWorkerMain() code but if you don't
> > like we can move it back. I am not sure I like the new function names
> > in the patch though.
>
> Okay, I'm fine with moving this code but perhaps we can find a better
> function name as "Wrapper" seems slightly odd to me.
>

Agreed.

> For example,
> start_table_sync_start() and start_apply_changes() or something (it
> seems we use the snake case for static functions in worker.c).
>

I am fine with something on these lines, how about start_table_sync()
and start_apply() respectively?

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Friday, February 25, 2022 12:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Feb 24, 2022 at 6:30 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Feb 24, 2022 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > Here are some comments:
> > > >
> > > > Why do we need SyncTableStartWrapper() and ApplyLoopWrapper()?
> > > >
> > >
> > > I have given this comment to move the related code to separate
> > > functions to slightly simplify ApplyWorkerMain() code but if you
> > > don't like we can move it back. I am not sure I like the new
> > > function names in the patch though.
> >
> > Okay, I'm fine with moving this code but perhaps we can find a better
> > function name as "Wrapper" seems slightly odd to me.
> >
> 
> Agreed.
> 
> > For example,
> > start_table_sync_start() and start_apply_changes() or something (it
> > seems we use the snake case for static functions in worker.c).
> >
> 
> I am fine with something on these lines, how about start_table_sync() and
> start_apply() respectively?
Adopted. (If we come up with better names, we can change those then)

Kindly have a look at attached the v22.
It has incorporated other improvements of TAP test,
refinement of the DisableSubscriptionOnError function and so on.

Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Thursday, February 24, 2022 8:09 PM Amit Kapila <amit.kapila16@gmail.com>
> On Thu, Feb 24, 2022 at 1:20 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > +       /*
> > +        * Log the error that caused DisableSubscriptionOnError to be
> called. We
> > +        * do this immediately so that it won't be lost if some other internal
> > +        * error occurs in the following code.
> > +        */
> > +       EmitErrorReport();
> > +       AbortOutOfAnyTransaction();
> > +       FlushErrorState();
> >
> > Do we need to hold interrupts during cleanup here?
> >
> 
> I think so. We do prevent interrupts via
> HOLD_INTERRUPTS/RESUME_INTERRUPTS during cleanup.
Fixed.

Kindly have a look at v22 shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Thursday, February 24, 2022 4:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Feb 22, 2022 at 3:03 PM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > ~~~
> >
> > 1. worker.c - comment
> >
> > + subform = (Form_pg_subscription) GETSTRUCT(tup);
> > +
> > + /*
> > + * We would not be here unless this subscription's disableonerr field
> > + was
> > + * true, but check whether that field has changed in the interim.
> > + */
> > + if (!subform->subdisableonerr)
> > + {
> > + heap_freetuple(tup);
> > + table_close(rel, RowExclusiveLock);
> > + CommitTransactionCommand();
> > + return false;
> > + }
> >
> > I felt that comment belongs above the subform assignment because that
> > is the only reason we are getting the subform again.
> 
> IIUC if we return false here, the same error will be emitted twice.
> And I'm not sure this check is really necessary. It would work only when the
> subdisableonerr is set to false concurrently, but doesn't work for the opposite
> caces. I think we can check
> MySubscription->disableonerr and then just update the tuple.
Addressed. I followed your advice and deleted the check.


Kindly have a look at v22 shared in [1].


[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, February 23, 2022 6:52 PM Tang, Haiying/唐 海英 <tanghy.fnst@fujitsu.com> wrote:
> I have a comment on v21 patch.
> 
> I wonder if we really need subscription s2 in 028_disable_on_error.pl. I think for
> subscription s2, we only tested some normal cases(which could be tested with
> s1), and didn't test any error case, which means it wouldn't be automatically
> disabled.
> Is there any reason for creating subscription s2?
Removed the subscription s2.

This has reduced the code amount of TAP tests.
Kindly have a look at the v22 shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
    Takamichi Osumi


RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, February 22, 2022 3:03 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Here are a couple more review comments for v21.
> 
> ~~~
> 
> 1. worker.c - comment
> 
> + subform = (Form_pg_subscription) GETSTRUCT(tup);
> +
> + /*
> + * We would not be here unless this subscription's disableonerr field
> + was
> + * true, but check whether that field has changed in the interim.
> + */
> + if (!subform->subdisableonerr)
> + {
> + heap_freetuple(tup);
> + table_close(rel, RowExclusiveLock);
> + CommitTransactionCommand();
> + return false;
> + }
> 
> I felt that comment belongs above the subform assignment because that is the
> only reason we are getting the subform again.
This part has been removed along with the modification
that we just disable the subscription in the main processing
when we get an error.

 
> ~~
> 
> 2. worker.c - subform->oid
> 
> + /* Notify the subscription will be no longer valid */ ereport(LOG,
> + errmsg("logical replication subscription \"%s\" will be disabled due
> to an error",
> +    MySubscription->name));
> +
> + LockSharedObject(SubscriptionRelationId, subform->oid, 0,
> AccessExclusiveLock);
> 
> Can't we just use MySubscription->oid here? We really only needed that
> subform to get new option values.
Fixed.


> ~~
> 
> 3. worker.c - whitespace
> 
> Your pg_indent has also changed some whitespace for parts of worker.c that
> are completely unrelated to this patch. You might want to revert those changes.
Fixed.

Kindly have a look at v22 that took in all your comments.
It's shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373D9B26F988307B0D3FE20ED3E9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


Please see below my review comments for v22.

======

1. Commit message

"table sync worker" -> "tablesync worker"

~~~

2. doc/src/sgml/catalogs.sgml

+      <para>
+       If true, the subscription will be disabled when subscription
+       workers detect any errors
+      </para></entry>

It felt a bit strange to say "subscription" 2x in the sentence, but I
am not sure how to improve it. Maybe like below?

BEFORE
If true, the subscription will be disabled when subscription workers
detect any errors

SUGGESTED
If true, the subscription will be disabled if one of its workers
detects an error

~~~

3. src/backend/replication/logical/worker.c - DisableSubscriptionOnError

@@ -2802,6 +2803,69 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 }

 /*
+ * Disable the current subscription, after error recovery processing.
+ */
+static void
+DisableSubscriptionOnError(void)

I thought the "after error recovery processing" part was a bit generic
and did not really say what it was doing.

BEFORE
Disable the current subscription, after error recovery processing.
SUGGESTED
Disable the current subscription, after logging the error that caused
this function to be called.

~~~

4. src/backend/replication/logical/worker.c - start_apply

+ if (MySubscription->disableonerr)
+ {
+ DisableSubscriptionOnError();
+ return;
+ }
+
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ PG_END_TRY();

The current code looks correct, but I felt it is a bit tricky to
easily see the execution path after the return.

Since it will effectively just exit anyhow I think it will be simpler
just to do that explicitly right here instead of the 'return'. This
will also make the code consistent with the same 'disableonerr' logic
of the start_start_sync.

SUGGESTION
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}

~~~

5. src/bin/pg_dump/pg_dump.c

@@ -4463,6 +4473,9 @@ dumpSubscription(Archive *fout, const
SubscriptionInfo *subinfo)
  if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
  appendPQExpBufferStr(query, ", two_phase = on");

+ if (strcmp(subinfo->subdisableonerr, "f") != 0)
+ appendPQExpBufferStr(query, ", disable_on_error = true");
+

Although the code is correct, I think it would be more natural to set
this option as true when the user wants it true. e.g. check for "t"
same as 'subbinary' is doing. This way, even if there was some
unknown/corrupted value the code would do nothing, which is the
behaviour you want...

SUGGESTION
if (strcmp(subinfo->subdisableonerr, "t") == 0)

~~~

6. src/include/catalog/pg_subscription.h

@@ -67,6 +67,9 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
BKI_SHARED_RELATION BKI_ROW

  char subtwophasestate; /* Stream two-phase transactions */

+ bool subdisableonerr; /* True if occurrence of apply errors
+ * should disable the subscription */

The comment seems not quite right because it's not just about apply
errors. E.g. I think any error in tablesync will cause disablement
too.

BEFORE
True if occurrence of apply errors should disable the subscription
SUGGESTED
True if a worker error should cause the subscription to be disabled

~~~

7. src/test/regress/sql/subscription.sql - whitespace

+-- now it works
+CREATE SUBSCRIPTION regress_testsub CONNECTION
'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect =
false, disable_on_error = false);
+
+\dRs+
+
+ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
+
+\dRs+
+ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
+DROP SUBSCRIPTION regress_testsub;
+

I think should be a blank line after that last \dRs+ just like the
other one, because it belongs logically with the code above it, not
with the ALTER slot_name.

~~~

8. src/test/subscription/t/028_disable_on_error.pl - filename

The 028 number needs to be bumped because there is already a TAP test
called 028 now

~~~

9. src/test/subscription/t/028_disable_on_error.pl - missing test

There was no test case for the last combination where the user correct
the apply worker problem: E.g. After a previous error/disable of the
subscriber, remove the index, publish the inserts again, and check
they get applied properly.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Friday, February 25, 2022 9:45 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> Kindly have a look at attached the v22.
> It has incorporated other improvements of TAP test, refinement of the
> DisableSubscriptionOnError function and so on.
The recent commit(7a85073) has changed the subscription workers
error handling. So, I rebased my disable_on_error patch first
for anyone who are interested in the review.

I'll incorporate incoming comments for v22 in my next version.

Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, March 1, 2022 9:49 AM Peter Smith <smithpb2250@gmail.com> wrote:
> Please see below my review comments for v22.
> 
> ======
> 
> 1. Commit message
> 
> "table sync worker" -> "tablesync worker"
Fixed.

> ~~~
> 
> 2. doc/src/sgml/catalogs.sgml
> 
> +      <para>
> +       If true, the subscription will be disabled when subscription
> +       workers detect any errors
> +      </para></entry>
> 
> It felt a bit strange to say "subscription" 2x in the sentence, but I am not sure
> how to improve it. Maybe like below?
> 
> BEFORE
> If true, the subscription will be disabled when subscription workers detect any
> errors
> 
> SUGGESTED
> If true, the subscription will be disabled if one of its workers detects an error
Fixed.


> ~~~
> 
> 3. src/backend/replication/logical/worker.c - DisableSubscriptionOnError
> 
> @@ -2802,6 +2803,69 @@ LogicalRepApplyLoop(XLogRecPtr
> last_received)  }
> 
>  /*
> + * Disable the current subscription, after error recovery processing.
> + */
> +static void
> +DisableSubscriptionOnError(void)
> 
> I thought the "after error recovery processing" part was a bit generic and did not
> really say what it was doing.
> 
> BEFORE
> Disable the current subscription, after error recovery processing.
> SUGGESTED
> Disable the current subscription, after logging the error that caused this
> function to be called.
Fixed.

> ~~~
> 
> 4. src/backend/replication/logical/worker.c - start_apply
> 
> + if (MySubscription->disableonerr)
> + {
> + DisableSubscriptionOnError();
> + return;
> + }
> +
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> + }
> + PG_END_TRY();
> 
> The current code looks correct, but I felt it is a bit tricky to easily see the
> execution path after the return.
> 
> Since it will effectively just exit anyhow I think it will be simpler just to do that
> explicitly right here instead of the 'return'. This will also make the code
> consistent with the same 'disableonerr' logic of the start_start_sync.
> 
> SUGGESTION
> if (MySubscription->disableonerr)
> {
> DisableSubscriptionOnError();
> proc_exit(0);
> }
Fixed.

> ~~~
> 
> 5. src/bin/pg_dump/pg_dump.c
> 
> @@ -4463,6 +4473,9 @@ dumpSubscription(Archive *fout, const
> SubscriptionInfo *subinfo)
>   if (strcmp(subinfo->subtwophasestate, two_phase_disabled) != 0)
>   appendPQExpBufferStr(query, ", two_phase = on");
> 
> + if (strcmp(subinfo->subdisableonerr, "f") != 0)
> + appendPQExpBufferStr(query, ", disable_on_error = true");
> +
> 
> Although the code is correct, I think it would be more natural to set this option
> as true when the user wants it true. e.g. check for "t"
> same as 'subbinary' is doing. This way, even if there was some
> unknown/corrupted value the code would do nothing, which is the behaviour
> you want...
> 
> SUGGESTION
> if (strcmp(subinfo->subdisableonerr, "t") == 0)
Fixed.


> ~~~
> 
> 6. src/include/catalog/pg_subscription.h
> 
> @@ -67,6 +67,9 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
> BKI_SHARED_RELATION BKI_ROW
> 
>   char subtwophasestate; /* Stream two-phase transactions */
> 
> + bool subdisableonerr; /* True if occurrence of apply errors
> + * should disable the subscription */
> 
> The comment seems not quite right because it's not just about apply errors. E.g.
> I think any error in tablesync will cause disablement too.
> 
> BEFORE
> True if occurrence of apply errors should disable the subscription SUGGESTED
> True if a worker error should cause the subscription to be disabled
Fixed.


> ~~~
> 
> 7. src/test/regress/sql/subscription.sql - whitespace
> 
> +-- now it works
> +CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false,
> disable_on_error = false);
> +
> +\dRs+
> +
> +ALTER SUBSCRIPTION regress_testsub SET (disable_on_error = true);
> +
> +\dRs+
> +ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); DROP
> +SUBSCRIPTION regress_testsub;
> +
> 
> I think should be a blank line after that last \dRs+ just like the other one,
> because it belongs logically with the code above it, not with the ALTER
> slot_name.
Fixed.


> ~~~
> 
> 8. src/test/subscription/t/028_disable_on_error.pl - filename
> 
> The 028 number needs to be bumped because there is already a TAP test
> called 028 now
This is already done in v22, so I've skipped this.

> ~~~
> 
> 9. src/test/subscription/t/028_disable_on_error.pl - missing test
> 
> There was no test case for the last combination where the user correct the
> apply worker problem: E.g. After a previous error/disable of the subscriber,
> remove the index, publish the inserts again, and check they get applied
> properly.
Fixed.

Attached the updated version v24.


Best Regards,
    Takamichi Osumi


Attachment
Please see below my review comments for v24.

======

1. src/backend/replication/logical/worker.c - start_table_sync

+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

(This review comment is just FYI in case you did not do this deliberately)

FYI, you didn't really need to call am_tablesync_worker() here because
it is already asserted for the sync phase that it MUST be a tablesync
worker.

OTOH, IMO it documents the purpose of the parm so if it was deliberate
then that is OK too.

~~~

2. src/backend/replication/logical/worker.c - start_table_sync

+ PG_CATCH();
+ {
+ /*
+ * Abort the current transaction so that we send the stats message in
+ * an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
+

[Maybe you will say that this review comment is unrelated to
disable_on_err, but since this is a totally new/refactored function
then I think maybe there is no problem to make this change at the same
time. Anyway there is no function change, it is just rearranging some
comments.]

I felt the separation of those 2 statements and comments makes that
code less clean than it could/should be. IMO they should be grouped
together.

SUGGESTED
/*
* Report the worker failed during table synchronization. Abort the
* current transaction so that the stats message is sent in an idle
* state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

~~~

3. src/backend/replication/logical/worker.c - start_apply

+ /*
+ * Abort the current transaction so that we send the stats message in
+ * an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during the application of the change */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

Same comment as #2 above, but this code fragment is in start_apply function.

~~~

4. src/test/subscription/t/029_disable_on_error.pl - comment

+# Drop the unique index on the sub and re-enabled the subscription.
+# Then, confirm that we have finished the apply.

SUGGESTED (tweak the comment wording)
# Drop the unique index on the sub and re-enable the subscription.
# Then, confirm that the previously failing insert was applied OK.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, March 2, 2022 9:34 AM Peter Smith <smithpb2250@gmail.com> wrote:
> Please see below my review comments for v24.
Thank you for checking my patch !

 
> ======
> 
> 1. src/backend/replication/logical/worker.c - start_table_sync
> 
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid,
> + !am_tablesync_worker());
> 
> (This review comment is just FYI in case you did not do this deliberately)
> 
> FYI, you didn't really need to call am_tablesync_worker() here because it is
> already asserted for the sync phase that it MUST be a tablesync worker.
> 
> OTOH, IMO it documents the purpose of the parm so if it was deliberate then
> that is OK too.
Fixed.


> ~~~
> 
> 2. src/backend/replication/logical/worker.c - start_table_sync
> 
> + PG_CATCH();
> + {
> + /*
> + * Abort the current transaction so that we send the stats message in
> + * an idle state.
> + */
> + AbortOutOfAnyTransaction();
> +
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid,
> + !am_tablesync_worker());
> +
> 
> [Maybe you will say that this review comment is unrelated to disable_on_err,
> but since this is a totally new/refactored function then I think maybe there is no
> problem to make this change at the same time. Anyway there is no function
> change, it is just rearranging some comments.]
> 
> I felt the separation of those 2 statements and comments makes that code less
> clean than it could/should be. IMO they should be grouped together.
> 
> SUGGESTED
> /*
> * Report the worker failed during table synchronization. Abort the
> * current transaction so that the stats message is sent in an idle
> * state.
> */
> AbortOutOfAnyTransaction();
> pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> er());
I think this is OK. Thank you for suggestion. Fixed.



> ~~~
> 
> 3. src/backend/replication/logical/worker.c - start_apply
> 
> + /*
> + * Abort the current transaction so that we send the stats message in
> + * an idle state.
> + */
> + AbortOutOfAnyTransaction();
> +
> + /* Report the worker failed during the application of the change */
> + pgstat_report_subscription_error(MySubscription->oid,
> + !am_tablesync_worker());
> 
> Same comment as #2 above, but this code fragment is in start_apply function.
Fixed.


> ~~~
> 
> 4. src/test/subscription/t/029_disable_on_error.pl - comment
> 
> +# Drop the unique index on the sub and re-enabled the subscription.
> +# Then, confirm that we have finished the apply.
> 
> SUGGESTED (tweak the comment wording)
> # Drop the unique index on the sub and re-enable the subscription.
> # Then, confirm that the previously failing insert was applied OK.
Fixed.


Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Wed, Mar 2, 2022 at 9:34 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Please see below my review comments for v24.
>
> ======
>
> 1. src/backend/replication/logical/worker.c - start_table_sync
>
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
>
> (This review comment is just FYI in case you did not do this deliberately)
>
> FYI, you didn't really need to call am_tablesync_worker() here because
> it is already asserted for the sync phase that it MUST be a tablesync
> worker.
>
> OTOH, IMO it documents the purpose of the parm so if it was deliberate
> then that is OK too.
>
> ~~~
>
> 2. src/backend/replication/logical/worker.c - start_table_sync
>
> + PG_CATCH();
> + {
> + /*
> + * Abort the current transaction so that we send the stats message in
> + * an idle state.
> + */
> + AbortOutOfAnyTransaction();
> +
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
> +
>
> [Maybe you will say that this review comment is unrelated to
> disable_on_err, but since this is a totally new/refactored function
> then I think maybe there is no problem to make this change at the same
> time. Anyway there is no function change, it is just rearranging some
> comments.]
>
> I felt the separation of those 2 statements and comments makes that
> code less clean than it could/should be. IMO they should be grouped
> together.
>
> SUGGESTED
> /*
> * Report the worker failed during table synchronization. Abort the
> * current transaction so that the stats message is sent in an idle
> * state.
> */
> AbortOutOfAnyTransaction();
> pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

After more thoughts, should we do both AbortOutOfAnyTransaction() and
error message handling while holding interrupts? That is,

HOLD_INTERRUPTS();
EmitErrorReport();
FlushErrorState();
AbortOutOfAny Transaction();
RESUME_INTERRUPTS();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

I think it's better that we do clean up first and then do other works
such as sending the message to the stats collector and updating the
catalog.

Here are some comments on v24 patch:

+        /* Look up our subscription in the catalogs */
+        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
+
CStringGetDatum(MySubscription->name));

s/catalogs/catalog/

Why don't we use SUBSCRIPTIONOID with MySubscription->oid?

---
+        if (!HeapTupleIsValid(tup))
+                ereport(ERROR,
+                                errcode(ERRCODE_UNDEFINED_OBJECT),
+                                errmsg("subscription \"%s\" does not exist",
+                                           MySubscription->name));

I think we should use elog() here rather than ereport() since it's a
should-not-happen error.

---
+        /* Notify the subscription will be no longer valid */

I'd suggest rephrasing it to like "Notify the subscription will be
disabled". (the subscription is still valid actually, but just
disabled).

---
+        /* Notify the subscription will be no longer valid */
+        ereport(LOG,
+                        errmsg("logical replication subscription
\"%s\" will be disabled due to an error",
+                                   MySubscription->name));
+

I think we can report the log at the end of this function rather than
during the transaction.

---
+my $cmd = qq(
+CREATE TABLE tbl (i INT);
+ALTER TABLE tbl REPLICA IDENTITY FULL;
+CREATE INDEX tbl_idx ON tbl(i));

I think we don't need to set REPLICA IDENTITY FULL to this table since
there is notupdate/delete.

Do we need tbl_idx?

---
+$cmd = qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr
+WHERE sr.srsubstate IN ('s', 'r'));
+$node_subscriber->poll_query_until('postgres', $cmd);

It seems better to add a condition of srrelid just in case.

---
+$cmd = qq(
+SELECT count(1) = 1 FROM pg_catalog.pg_subscription s
+WHERE s.subname = 'sub' AND s.subenabled IS FALSE);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for subscriber to be disabled";

I think that it's more natural to directly check the subscription's
subenabled. For example:

SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';

---
+$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
+$node_subscriber->safe_psql('postgres', $cmd);
+$cmd = q(SELECT COUNT(1) = 3 FROM tbl WHERE i = 3);
+$node_subscriber->poll_query_until('postgres', $cmd)
+  or die "Timed out while waiting for applying";

I think it's better to wait for the subscriber to catch up and check
the query result instead of using poll_query_until() so that we can
check the query result in case where the test fails.

---
+$cmd = qq(DROP INDEX tbl_unique);
+$node_subscriber->safe_psql('postgres', $cmd);

In the newly added tap tests, all queries are consistently assigned to
$cmd and executed even when the query is used only once. It seems a
different style from the one in other tap tests. Is there any reason
why we do this style for all queries in the tap tests?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> After more thoughts, should we do both AbortOutOfAnyTransaction() and error
> message handling while holding interrupts? That is,
> 
> HOLD_INTERRUPTS();
> EmitErrorReport();
> FlushErrorState();
> AbortOutOfAny Transaction();
> RESUME_INTERRUPTS();
> pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> er());
> 
> I think it's better that we do clean up first and then do other works such as
> sending the message to the stats collector and updating the catalog.
I agree. Fixed. Along with this change, I corrected the header comment of
DisableSubscriptionOnError, too.


> Here are some comments on v24 patch:
> 
> +        /* Look up our subscription in the catalogs */
> +        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
> +
> CStringGetDatum(MySubscription->name));
> 
> s/catalogs/catalog/
> 
> Why don't we use SUBSCRIPTIONOID with MySubscription->oid?
Changed.


> ---
> +        if (!HeapTupleIsValid(tup))
> +                ereport(ERROR,
> +                                errcode(ERRCODE_UNDEFINED_OBJECT),
> +                                errmsg("subscription \"%s\" does not
> exist",
> +                                           MySubscription->name));
> 
> I think we should use elog() here rather than ereport() since it's a
> should-not-happen error.
Fixed


> ---
> +        /* Notify the subscription will be no longer valid */
> 
> I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
> subscription is still valid actually, but just disabled).
Fixed. Also, I've made this sentence past one, because of the code place
change below.

 
> ---
> +        /* Notify the subscription will be no longer valid */
> +        ereport(LOG,
> +                        errmsg("logical replication subscription
> \"%s\" will be disabled due to an error",
> +                                   MySubscription->name));
> +
> 
> I think we can report the log at the end of this function rather than during the
> transaction.
Fixed. In this case, I needed to adjust the comment to indicate the processing
to disable the sub has *completed* as well.

> ---
> +my $cmd = qq(
> +CREATE TABLE tbl (i INT);
> +ALTER TABLE tbl REPLICA IDENTITY FULL;
> +CREATE INDEX tbl_idx ON tbl(i));
> 
> I think we don't need to set REPLICA IDENTITY FULL to this table since there is
> notupdate/delete.
> 
> Do we need tbl_idx?
Removed both the replica identity and tbl_idx;


> ---
> +$cmd = qq(
> +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> +sr.srsubstate IN ('s', 'r'));
> +$node_subscriber->poll_query_until('postgres', $cmd);
> 
> It seems better to add a condition of srrelid just in case.
Makes sense. Fixed.


> ---
> +$cmd = qq(
> +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
> s.subname =
> +'sub' AND s.subenabled IS FALSE);
> +$node_subscriber->poll_query_until('postgres', $cmd)
> +  or die "Timed out while waiting for subscriber to be disabled";
> 
> I think that it's more natural to directly check the subscription's subenabled.
> For example:
> 
> SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';
Fixed. I modified another code similar to this for tablesync error as well.


> ---
> +$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
> +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
> COUNT(1)
> += 3 FROM tbl WHERE i = 3);
> +$node_subscriber->poll_query_until('postgres', $cmd)
> +  or die "Timed out while waiting for applying";
> 
> I think it's better to wait for the subscriber to catch up and check the query
> result instead of using poll_query_until() so that we can check the query result
> in case where the test fails.
I modified the code to wait for the subscriber and deleted poll_query_until.
Also, when I consider the test failure for this test as you mentioned,
expecting and checking the number of return value that equals 3
would be better. So, I expressed this point in my test as well,
according to your advice.


> ---
> +$cmd = qq(DROP INDEX tbl_unique);
> +$node_subscriber->safe_psql('postgres', $cmd);
> 
> In the newly added tap tests, all queries are consistently assigned to $cmd and
> executed even when the query is used only once. It seems a different style from
> the one in other tap tests. Is there any reason why we do this style for all queries
> in the tap tests?
Fixed. I removed the 'cmd' variable itself.


Attached an updated patch v26.

Best Regards,
    Takamichi Osumi


Attachment

Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Wed, Mar 2, 2022 at 6:38 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > After more thoughts, should we do both AbortOutOfAnyTransaction() and error
> > message handling while holding interrupts? That is,
> >
> > HOLD_INTERRUPTS();
> > EmitErrorReport();
> > FlushErrorState();
> > AbortOutOfAny Transaction();
> > RESUME_INTERRUPTS();
> > pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> > er());
> >
> > I think it's better that we do clean up first and then do other works such as
> > sending the message to the stats collector and updating the catalog.
> I agree. Fixed. Along with this change, I corrected the header comment of
> DisableSubscriptionOnError, too.
>
>
> > Here are some comments on v24 patch:
> >
> > +        /* Look up our subscription in the catalogs */
> > +        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
> > +
> > CStringGetDatum(MySubscription->name));
> >
> > s/catalogs/catalog/
> >
> > Why don't we use SUBSCRIPTIONOID with MySubscription->oid?
> Changed.
>
>
> > ---
> > +        if (!HeapTupleIsValid(tup))
> > +                ereport(ERROR,
> > +                                errcode(ERRCODE_UNDEFINED_OBJECT),
> > +                                errmsg("subscription \"%s\" does not
> > exist",
> > +                                           MySubscription->name));
> >
> > I think we should use elog() here rather than ereport() since it's a
> > should-not-happen error.
> Fixed
>
>
> > ---
> > +        /* Notify the subscription will be no longer valid */
> >
> > I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
> > subscription is still valid actually, but just disabled).
> Fixed. Also, I've made this sentence past one, because of the code place
> change below.
>
>
> > ---
> > +        /* Notify the subscription will be no longer valid */
> > +        ereport(LOG,
> > +                        errmsg("logical replication subscription
> > \"%s\" will be disabled due to an error",
> > +                                   MySubscription->name));
> > +
> >
> > I think we can report the log at the end of this function rather than during the
> > transaction.
> Fixed. In this case, I needed to adjust the comment to indicate the processing
> to disable the sub has *completed* as well.
>
> > ---
> > +my $cmd = qq(
> > +CREATE TABLE tbl (i INT);
> > +ALTER TABLE tbl REPLICA IDENTITY FULL;
> > +CREATE INDEX tbl_idx ON tbl(i));
> >
> > I think we don't need to set REPLICA IDENTITY FULL to this table since there is
> > notupdate/delete.
> >
> > Do we need tbl_idx?
> Removed both the replica identity and tbl_idx;
>
>
> > ---
> > +$cmd = qq(
> > +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> > +sr.srsubstate IN ('s', 'r'));
> > +$node_subscriber->poll_query_until('postgres', $cmd);
> >
> > It seems better to add a condition of srrelid just in case.
> Makes sense. Fixed.
>
>
> > ---
> > +$cmd = qq(
> > +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
> > s.subname =
> > +'sub' AND s.subenabled IS FALSE);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > +  or die "Timed out while waiting for subscriber to be disabled";
> >
> > I think that it's more natural to directly check the subscription's subenabled.
> > For example:
> >
> > SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';
> Fixed. I modified another code similar to this for tablesync error as well.
>
>
> > ---
> > +$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
> > +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
> > COUNT(1)
> > += 3 FROM tbl WHERE i = 3);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > +  or die "Timed out while waiting for applying";
> >
> > I think it's better to wait for the subscriber to catch up and check the query
> > result instead of using poll_query_until() so that we can check the query result
> > in case where the test fails.
> I modified the code to wait for the subscriber and deleted poll_query_until.
> Also, when I consider the test failure for this test as you mentioned,
> expecting and checking the number of return value that equals 3
> would be better. So, I expressed this point in my test as well,
> according to your advice.
>
>
> > ---
> > +$cmd = qq(DROP INDEX tbl_unique);
> > +$node_subscriber->safe_psql('postgres', $cmd);
> >
> > In the newly added tap tests, all queries are consistently assigned to $cmd and
> > executed even when the query is used only once. It seems a different style from
> > the one in other tap tests. Is there any reason why we do this style for all queries
> > in the tap tests?
> Fixed. I removed the 'cmd' variable itself.
>
>
> Attached an updated patch v26.

Thank you for updating the patch.

Here are some comments on v26 patch:

+/*
+ * Disable the current subscription.
+ */
+static void
+DisableSubscriptionOnError(void)

This function now just updates the pg_subscription catalog so can we
move it to pg_subscritpion.c while having this function accept the
subscription OID to disable? If you agree, the function comment will
also need to be updated.

---
+                /*
+                 * First, ensure that we log the error message so
that it won't be
+                 * lost if some other internal error occurs in the
following code.
+                 * Then, abort the current transaction and send the
stats message of
+                 * the table synchronization failure in an idle state.
+                 */
+                HOLD_INTERRUPTS();
+                EmitErrorReport();
+                FlushErrorState();
+                AbortOutOfAnyTransaction();
+                RESUME_INTERRUPTS();
+                pgstat_report_subscription_error(MySubscription->oid, false);
+
+                if (MySubscription->disableonerr)
+                {
+                        DisableSubscriptionOnError();
+                        proc_exit(0);
+                }
+
+                PG_RE_THROW();

If the disableonerr is false, the same error is reported twice. Also,
the code in PG_CATCH() in both start_apply() and start_table_sync()
are almost the same. Can we create a common function to do post-error
processing?

The worker should exit with return code 1.

I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.

---
+
+# Confirm that we have finished the table sync.
+is( $node_subscriber->safe_psql(
+                'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)),
+        "1|3",
+        "subscription sub replicated data");
+

Can we store the result to a local variable to check? I think it's
more consistent with other tap tests.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment
On Fri, Mar 4, 2022 at 5:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 2, 2022 at 6:38 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > After more thoughts, should we do both AbortOutOfAnyTransaction() and error
> > > message handling while holding interrupts? That is,
> > >
> > > HOLD_INTERRUPTS();
> > > EmitErrorReport();
> > > FlushErrorState();
> > > AbortOutOfAny Transaction();
> > > RESUME_INTERRUPTS();
> > > pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> > > er());
> > >
> > > I think it's better that we do clean up first and then do other works such as
> > > sending the message to the stats collector and updating the catalog.
> > I agree. Fixed. Along with this change, I corrected the header comment of
> > DisableSubscriptionOnError, too.
> >
> >
> > > Here are some comments on v24 patch:
> > >
> > > +        /* Look up our subscription in the catalogs */
> > > +        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
> > > +
> > > CStringGetDatum(MySubscription->name));
> > >
> > > s/catalogs/catalog/
> > >
> > > Why don't we use SUBSCRIPTIONOID with MySubscription->oid?
> > Changed.
> >
> >
> > > ---
> > > +        if (!HeapTupleIsValid(tup))
> > > +                ereport(ERROR,
> > > +                                errcode(ERRCODE_UNDEFINED_OBJECT),
> > > +                                errmsg("subscription \"%s\" does not
> > > exist",
> > > +                                           MySubscription->name));
> > >
> > > I think we should use elog() here rather than ereport() since it's a
> > > should-not-happen error.
> > Fixed
> >
> >
> > > ---
> > > +        /* Notify the subscription will be no longer valid */
> > >
> > > I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
> > > subscription is still valid actually, but just disabled).
> > Fixed. Also, I've made this sentence past one, because of the code place
> > change below.
> >
> >
> > > ---
> > > +        /* Notify the subscription will be no longer valid */
> > > +        ereport(LOG,
> > > +                        errmsg("logical replication subscription
> > > \"%s\" will be disabled due to an error",
> > > +                                   MySubscription->name));
> > > +
> > >
> > > I think we can report the log at the end of this function rather than during the
> > > transaction.
> > Fixed. In this case, I needed to adjust the comment to indicate the processing
> > to disable the sub has *completed* as well.
> >
> > > ---
> > > +my $cmd = qq(
> > > +CREATE TABLE tbl (i INT);
> > > +ALTER TABLE tbl REPLICA IDENTITY FULL;
> > > +CREATE INDEX tbl_idx ON tbl(i));
> > >
> > > I think we don't need to set REPLICA IDENTITY FULL to this table since there is
> > > notupdate/delete.
> > >
> > > Do we need tbl_idx?
> > Removed both the replica identity and tbl_idx;
> >
> >
> > > ---
> > > +$cmd = qq(
> > > +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> > > +sr.srsubstate IN ('s', 'r'));
> > > +$node_subscriber->poll_query_until('postgres', $cmd);
> > >
> > > It seems better to add a condition of srrelid just in case.
> > Makes sense. Fixed.
> >
> >
> > > ---
> > > +$cmd = qq(
> > > +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
> > > s.subname =
> > > +'sub' AND s.subenabled IS FALSE);
> > > +$node_subscriber->poll_query_until('postgres', $cmd)
> > > +  or die "Timed out while waiting for subscriber to be disabled";
> > >
> > > I think that it's more natural to directly check the subscription's subenabled.
> > > For example:
> > >
> > > SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';
> > Fixed. I modified another code similar to this for tablesync error as well.
> >
> >
> > > ---
> > > +$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
> > > +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
> > > COUNT(1)
> > > += 3 FROM tbl WHERE i = 3);
> > > +$node_subscriber->poll_query_until('postgres', $cmd)
> > > +  or die "Timed out while waiting for applying";
> > >
> > > I think it's better to wait for the subscriber to catch up and check the query
> > > result instead of using poll_query_until() so that we can check the query result
> > > in case where the test fails.
> > I modified the code to wait for the subscriber and deleted poll_query_until.
> > Also, when I consider the test failure for this test as you mentioned,
> > expecting and checking the number of return value that equals 3
> > would be better. So, I expressed this point in my test as well,
> > according to your advice.
> >
> >
> > > ---
> > > +$cmd = qq(DROP INDEX tbl_unique);
> > > +$node_subscriber->safe_psql('postgres', $cmd);
> > >
> > > In the newly added tap tests, all queries are consistently assigned to $cmd and
> > > executed even when the query is used only once. It seems a different style from
> > > the one in other tap tests. Is there any reason why we do this style for all queries
> > > in the tap tests?
> > Fixed. I removed the 'cmd' variable itself.
> >
> >
> > Attached an updated patch v26.
>
> Thank you for updating the patch.
>
> Here are some comments on v26 patch:
>
> +/*
> + * Disable the current subscription.
> + */
> +static void
> +DisableSubscriptionOnError(void)
>
> This function now just updates the pg_subscription catalog so can we
> move it to pg_subscritpion.c while having this function accept the
> subscription OID to disable? If you agree, the function comment will
> also need to be updated.
>
> ---
> +                /*
> +                 * First, ensure that we log the error message so
> that it won't be
> +                 * lost if some other internal error occurs in the
> following code.
> +                 * Then, abort the current transaction and send the
> stats message of
> +                 * the table synchronization failure in an idle state.
> +                 */
> +                HOLD_INTERRUPTS();
> +                EmitErrorReport();
> +                FlushErrorState();
> +                AbortOutOfAnyTransaction();
> +                RESUME_INTERRUPTS();
> +                pgstat_report_subscription_error(MySubscription->oid, false);
> +
> +                if (MySubscription->disableonerr)
> +                {
> +                        DisableSubscriptionOnError();
> +                        proc_exit(0);
> +                }
> +
> +                PG_RE_THROW();
>
> If the disableonerr is false, the same error is reported twice. Also,
> the code in PG_CATCH() in both start_apply() and start_table_sync()
> are almost the same. Can we create a common function to do post-error
> processing?
>
> The worker should exit with return code 1.
>
> I've attached a fixup patch for changes to worker.c for your
> reference. Feel free to adopt the changes.

The way that common function is implemented required removal of the
existing PG_RE_THROW logic, which in turn was only possible using
special knowledge that this just happens to be the last try/catch
block for the apply worker. Yes, I believe everything will work ok,
but it just seemed like a step too far for me to change the throw
logic. I feel that once you get to the point of having to write
special comments in the code to explain "why we can get away with
doing this..." then that is an indication that perhaps it's not really
the best way...

Is there some alternative way to share common code, but without having
to change the existing throw error logic to do so?

OTOH, maybe others think it is ok?

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Optionally automatically disable logical replication subscriptions on error

From
"shiy.fnst@fujitsu.com"
Date:
On Wed, Mar 2, 2022 5:39 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
> 
> Attached an updated patch v26.
> 

Thanks for your patch. A comment on the document.

@@ -7771,6 +7771,16 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subdisableonerr</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the subscription will be disabled if one of its workers
+       detects an error
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
        <structfield>subconninfo</structfield> <type>text</type>
       </para>
       <para>

The document for "subdisableonerr" option is placed after "The following
parameters control what happens during subscription creation: ". I think it
should be placed after "The following parameters control the subscription's
replication behavior after it has been created: ", right?

Regards,
Shi yu

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Friday, March 4, 2022 3:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for updating the patch.
> 
> Here are some comments on v26 patch:
Thank you for your review !



> +/*
> + * Disable the current subscription.
> + */
> +static void
> +DisableSubscriptionOnError(void)
> 
> This function now just updates the pg_subscription catalog so can we move it
> to pg_subscritpion.c while having this function accept the subscription OID to
> disable? If you agree, the function comment will also need to be updated.
Agreed. Fixed.


> ---
> +                /*
> +                 * First, ensure that we log the error message so
> that it won't be
> +                 * lost if some other internal error occurs in the
> following code.
> +                 * Then, abort the current transaction and send the
> stats message of
> +                 * the table synchronization failure in an idle state.
> +                 */
> +                HOLD_INTERRUPTS();
> +                EmitErrorReport();
> +                FlushErrorState();
> +                AbortOutOfAnyTransaction();
> +                RESUME_INTERRUPTS();
> +                pgstat_report_subscription_error(MySubscription->oid,
> + false);
> +
> +                if (MySubscription->disableonerr)
> +                {
> +                        DisableSubscriptionOnError();
> +                        proc_exit(0);
> +                }
> +
> +                PG_RE_THROW();
> 
> If the disableonerr is false, the same error is reported twice. Also, the code in
> PG_CATCH() in both start_apply() and start_table_sync() are almost the same.
> Can we create a common function to do post-error processing?
Yes. Also, calling PG_RE_THROW wasn't appropriate,
because in the previous v26, for the second error you mentioned,
the patch didn't call errstart when disable_on_error = false.
This was introduced by recent patch refactoring around this code and the rebase
of this patch, but has been fixed by your suggestion.


> The worker should exit with return code 1.
> I've attached a fixup patch for changes to worker.c for your reference. Feel free
> to adopt the changes.
Yes. I adopted almost all of your suggestion.
One thing I fixed was a comment that mentioned table sync
in worker_post_error_processing(), because start_apply()
also uses the function.


> 
> ---
> +
> +# Confirm that we have finished the table sync.
> +is( $node_subscriber->safe_psql(
> +                'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)),
> +        "1|3",
> +        "subscription sub replicated data");
> +
> 
> Can we store the result to a local variable to check? I think it's more consistent
> with other tap tests.
Agreed. Fixed.


Attached the v27. Kindly review the patch.


Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, March 7, 2022 12:01 PM Shi, Yu/侍 雨 <shiy.fnst@fujitsu.com> wrote:
> On Wed, Mar 2, 2022 5:39 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Attached an updated patch v26.
> >
> 
> Thanks for your patch. A comment on the document.
Hi, thank you for checking my patch !


> @@ -7771,6 +7771,16 @@ SCRAM-SHA-256$<replaceable><iteration
> count></replaceable>:<replaceable>&l
> 
>       <row>
>        <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>subdisableonerr</structfield> <type>bool</type>
> +      </para>
> +      <para>
> +       If true, the subscription will be disabled if one of its workers
> +       detects an error
> +      </para></entry>
> +     </row>
> +
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
>         <structfield>subconninfo</structfield> <type>text</type>
>        </para>
>        <para>
> 
> The document for "subdisableonerr" option is placed after "The following
> parameters control what happens during subscription creation: ". I think it
> should be placed after "The following parameters control the subscription's
> replication behavior after it has been created: ", right?
Addressed your comment for create_subscription.sgml
(not for catalogs.sgml).

Attached an updated patch v28.


Best Regards,
    Takamichi Osumi


Attachment
On Mon, Mar 7, 2022 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Fri, Mar 4, 2022 at 5:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > ---
> > +                /*
> > +                 * First, ensure that we log the error message so
> > that it won't be
> > +                 * lost if some other internal error occurs in the
> > following code.
> > +                 * Then, abort the current transaction and send the
> > stats message of
> > +                 * the table synchronization failure in an idle state.
> > +                 */
> > +                HOLD_INTERRUPTS();
> > +                EmitErrorReport();
> > +                FlushErrorState();
> > +                AbortOutOfAnyTransaction();
> > +                RESUME_INTERRUPTS();
> > +                pgstat_report_subscription_error(MySubscription->oid, false);
> > +
> > +                if (MySubscription->disableonerr)
> > +                {
> > +                        DisableSubscriptionOnError();
> > +                        proc_exit(0);
> > +                }
> > +
> > +                PG_RE_THROW();
> >
> > If the disableonerr is false, the same error is reported twice. Also,
> > the code in PG_CATCH() in both start_apply() and start_table_sync()
> > are almost the same. Can we create a common function to do post-error
> > processing?
> >
> > The worker should exit with return code 1.
> >
> > I've attached a fixup patch for changes to worker.c for your
> > reference. Feel free to adopt the changes.
>
> The way that common function is implemented required removal of the
> existing PG_RE_THROW logic, which in turn was only possible using
> special knowledge that this just happens to be the last try/catch
> block for the apply worker.
>

I think we should re_throw the error in case we have not handled it by
disabling the subscription (in which case we can exit with success
code (0)).

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, March 7, 2022 5:45 PM Amit Kaila <amit.kapila16@gmail.com> wrote:
> On Mon, Mar 7, 2022 at 4:55 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > On Fri, Mar 4, 2022 at 5:55 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > ---
> > > +                /*
> > > +                 * First, ensure that we log the error message so
> > > that it won't be
> > > +                 * lost if some other internal error occurs in the
> > > following code.
> > > +                 * Then, abort the current transaction and send the
> > > stats message of
> > > +                 * the table synchronization failure in an idle state.
> > > +                 */
> > > +                HOLD_INTERRUPTS();
> > > +                EmitErrorReport();
> > > +                FlushErrorState();
> > > +                AbortOutOfAnyTransaction();
> > > +                RESUME_INTERRUPTS();
> > > +
> > > + pgstat_report_subscription_error(MySubscription->oid, false);
> > > +
> > > +                if (MySubscription->disableonerr)
> > > +                {
> > > +                        DisableSubscriptionOnError();
> > > +                        proc_exit(0);
> > > +                }
> > > +
> > > +                PG_RE_THROW();
> > >
> > > If the disableonerr is false, the same error is reported twice.
> > > Also, the code in PG_CATCH() in both start_apply() and
> > > start_table_sync() are almost the same. Can we create a common
> > > function to do post-error processing?
> > >
> > > The worker should exit with return code 1.
> > >
> > > I've attached a fixup patch for changes to worker.c for your
> > > reference. Feel free to adopt the changes.
> >
> > The way that common function is implemented required removal of the
> > existing PG_RE_THROW logic, which in turn was only possible using
> > special knowledge that this just happens to be the last try/catch
> > block for the apply worker.
> >
> 
> I think we should re_throw the error in case we have not handled it by disabling
> the subscription (in which case we can exit with success code (0)).
Agreed. Fixed the patch so that it use re_throw.

Another point I changed from v28 is the order
to call AbortOutOfAnyTransaction and FlushErrorState,
which now is more aligned with other places.

Kindly check the attached v29.

Best Regards,
    Takamichi Osumi


Attachment
Please find below some review comments for v29.

======

1. src/backend/replication/logical/worker.c - worker_post_error_processing

+/*
+ * Abort and cleanup the current transaction, then do post-error processing.
+ * This function must be called in a PG_CATCH() block.
+ */
+static void
+worker_post_error_processing(void)

The function comment and function name are too vague/generic. I guess
this is a hang-over from Sawada-san's proposed patch, but now since
this is only called when disabling the subscription so both the
comment and the function name should say that's what it is doing...

e.g. rename to DisableSubscriptionOnError() or something similar.

~~~

2. src/backend/replication/logical/worker.c - worker_post_error_processing

+ /* Notify the subscription has been disabled */
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",
+    MySubscription->name));

  proc_exit(0);
 }

I know this is common code, but IMO it would be better to do the
proc_exit(0); from the caller in the PG_CATCH. Then I think the code
will be much easier to read the throw/exit logic, rather than now
where it is just calling some function that never returns...

Alternatively, if you want the code how it is, then the function name
should give some hint that it is never going to return - e.g.
DisableSubscriptionOnErrorAndExit)

~~~

3. src/backend/replication/logical/worker.c - start_table_sync

+ {
+ /*
+ * Abort the current transaction so that we send the stats message
+ * in an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed during table synchronization */
+ pgstat_report_subscription_error(MySubscription->oid, false);
+
+ PG_RE_THROW();
+ }

(This is a repeat of a previous comment from [1] comment #2)

I felt the separation of those 2 statements and comments makes the
code less clean than it could/should be. IMO they should be grouped
together.

SUGGESTED

/*
* Report the worker failed during table synchronization. Abort the
* current transaction so that the stats message is sent in an idle
* state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

~~~

4. src/backend/replication/logical/worker.c - start_apply

+ {
+ /*
+ * Abort the current transaction so that we send the stats message
+ * in an idle state.
+ */
+ AbortOutOfAnyTransaction();
+
+ /* Report the worker failed while applying changes */
+ pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
+
+ PG_RE_THROW();
+ }

(same as #3 but comment says "while applying changes")

SUGGESTED

/*
* Report the worker failed while applying changing. Abort the current
* transaction so that the stats message is sent in an idle state.
*/
AbortOutOfAnyTransaction();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

------
[1] https://www.postgresql.org/message-id/CAHut%2BPucrizJpqhSyD7dKj1yRkNMskqmiekD_RRqYpdDdusMRQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



On Tue, Mar 8, 2022 at 9:37 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Please find below some review comments for v29.
>
> ======
>
> 1. src/backend/replication/logical/worker.c - worker_post_error_processing
>
> +/*
> + * Abort and cleanup the current transaction, then do post-error processing.
> + * This function must be called in a PG_CATCH() block.
> + */
> +static void
> +worker_post_error_processing(void)
>
> The function comment and function name are too vague/generic. I guess
> this is a hang-over from Sawada-san's proposed patch, but now since
> this is only called when disabling the subscription so both the
> comment and the function name should say that's what it is doing...
>
> e.g. rename to DisableSubscriptionOnError() or something similar.
>
> ~~~
>
> 2. src/backend/replication/logical/worker.c - worker_post_error_processing
>
> + /* Notify the subscription has been disabled */
> + ereport(LOG,
> + errmsg("logical replication subscription \"%s\" has been be disabled
> due to an error",
> +    MySubscription->name));
>
>   proc_exit(0);
>  }
>
> I know this is common code, but IMO it would be better to do the
> proc_exit(0); from the caller in the PG_CATCH. Then I think the code
> will be much easier to read the throw/exit logic, rather than now
> where it is just calling some function that never returns...
>
> Alternatively, if you want the code how it is, then the function name
> should give some hint that it is never going to return - e.g.
> DisableSubscriptionOnErrorAndExit)
>

I think we are already in error so maybe it is better to name it as
DisableSubscriptionAndExit.

Few other comments:
=================
1.
DisableSubscription()
{
..
+
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);

Why do we need AccessExclusiveLock here? The Alter/Drop Subscription
takes AccessExclusiveLock, so AccessShareLock should be sufficient
unless we have a reason to use AccessExclusiveLock lock. The other
similar usages in this file (pg_subscription.c) also take
AccessShareLock.

2. Shall we mention this feature in conflict handling docs [1]:
Now:
To skip the transaction, the subscription needs to be disabled
temporarily by ALTER SUBSCRIPTION ... DISABLE first.

After:
To skip the transaction, the subscription needs to be disabled
temporarily by ALTER SUBSCRIPTION ... DISABLE first or alternatively,
the subscription can be used with the disable_on_error option.

Feel free to use something on the above lines, if you agree.

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, March 8, 2022 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 8, 2022 at 9:37 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Please find below some review comments for v29.
> >
> > ======
> >
> > 1. src/backend/replication/logical/worker.c -
> > worker_post_error_processing
> >
> > +/*
> > + * Abort and cleanup the current transaction, then do post-error processing.
> > + * This function must be called in a PG_CATCH() block.
> > + */
> > +static void
> > +worker_post_error_processing(void)
> >
> > The function comment and function name are too vague/generic. I guess
> > this is a hang-over from Sawada-san's proposed patch, but now since
> > this is only called when disabling the subscription so both the
> > comment and the function name should say that's what it is doing...
> >
> > e.g. rename to DisableSubscriptionOnError() or something similar.
> >
> > ~~~
> >
> > 2. src/backend/replication/logical/worker.c -
> > worker_post_error_processing
> >
> > + /* Notify the subscription has been disabled */ ereport(LOG,
> > + errmsg("logical replication subscription \"%s\" has been be disabled
> > due to an error",
> > +    MySubscription->name));
> >
> >   proc_exit(0);
> >  }
> >
> > I know this is common code, but IMO it would be better to do the
> > proc_exit(0); from the caller in the PG_CATCH. Then I think the code
> > will be much easier to read the throw/exit logic, rather than now
> > where it is just calling some function that never returns...
> >
> > Alternatively, if you want the code how it is, then the function name
> > should give some hint that it is never going to return - e.g.
> > DisableSubscriptionOnErrorAndExit)
> >
> 
> I think we are already in error so maybe it is better to name it as
> DisableSubscriptionAndExit.
OK. Renamed.


 
> Few other comments:
> =================
> 1.
> DisableSubscription()
> {
> ..
> +
> + LockSharedObject(SubscriptionRelationId, subid, 0,
> + AccessExclusiveLock);
> 
> Why do we need AccessExclusiveLock here? The Alter/Drop Subscription
> takes AccessExclusiveLock, so AccessShareLock should be sufficient unless
> we have a reason to use AccessExclusiveLock lock. The other similar usages in
> this file (pg_subscription.c) also take AccessShareLock.
Fixed.

 
> 2. Shall we mention this feature in conflict handling docs [1]:
> Now:
> To skip the transaction, the subscription needs to be disabled temporarily by
> ALTER SUBSCRIPTION ... DISABLE first.
> 
> After:
> To skip the transaction, the subscription needs to be disabled temporarily by
> ALTER SUBSCRIPTION ... DISABLE first or alternatively, the subscription can
> be used with the disable_on_error option.
> 
> Feel free to use something on the above lines, if you agree.
Agreed. Fixed.

At the same time, the attached v30 has incorporated
some rebase results of recent commit(d3e8368)
so that start_table_sync allocates the origin names
in long-lived context. Accoring to this, I modified
some comments on this function.

I made some comments for sending stats in
start_table_sync and start_apply united and concise,
which were pointed out by Peter Smith in [1].

[1] - https://www.postgresql.org/message-id/CAHut%2BPs3b8HjsVyo-Aygtnxbw1PiVOC9nvrN6dKxYtS4C8s%2Bgw%40mail.gmail.com

Kindly have a look at v30.

Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, March 8, 2022 1:07 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Please find below some review comments for v29.
Thank you for your comments !


 
> ======
> 
> 1. src/backend/replication/logical/worker.c - worker_post_error_processing
> 
> +/*
> + * Abort and cleanup the current transaction, then do post-error processing.
> + * This function must be called in a PG_CATCH() block.
> + */
> +static void
> +worker_post_error_processing(void)
> 
> The function comment and function name are too vague/generic. I guess this is
> a hang-over from Sawada-san's proposed patch, but now since this is only
> called when disabling the subscription so both the comment and the function
> name should say that's what it is doing...
> 
> e.g. rename to DisableSubscriptionOnError() or something similar.
Fixed the comments and the function name in v30 shared in [1].




> ~~~
> 
> 2. src/backend/replication/logical/worker.c - worker_post_error_processing
> 
> + /* Notify the subscription has been disabled */ ereport(LOG,
> + errmsg("logical replication subscription \"%s\" has been be disabled
> due to an error",
> +    MySubscription->name));
> 
>   proc_exit(0);
>  }
> 
> I know this is common code, but IMO it would be better to do the proc_exit(0);
> from the caller in the PG_CATCH. Then I think the code will be much easier to
> read the throw/exit logic, rather than now where it is just calling some function
> that never returns...
> 
> Alternatively, if you want the code how it is, then the function name should give
> some hint that it is never going to return - e.g.
> DisableSubscriptionOnErrorAndExit)
I renamed it to DisableSubscriptionAndExit in the end
according to the discussion.



> ~~~
> 
> 3. src/backend/replication/logical/worker.c - start_table_sync
> 
> + {
> + /*
> + * Abort the current transaction so that we send the stats message
> + * in an idle state.
> + */
> + AbortOutOfAnyTransaction();
> +
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid, false);
> +
> + PG_RE_THROW();
> + }
> 
> (This is a repeat of a previous comment from [1] comment #2)
> 
> I felt the separation of those 2 statements and comments makes the code less
> clean than it could/should be. IMO they should be grouped together.
> 
> SUGGESTED
> 
> /*
> * Report the worker failed during table synchronization. Abort the
> * current transaction so that the stats message is sent in an idle
> * state.
> */
> AbortOutOfAnyTransaction();
> pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> er());
Fixed.



> ~~~
> 
> 4. src/backend/replication/logical/worker.c - start_apply
> 
> + {
> + /*
> + * Abort the current transaction so that we send the stats message
> + * in an idle state.
> + */
> + AbortOutOfAnyTransaction();
> +
> + /* Report the worker failed while applying changes */
> + pgstat_report_subscription_error(MySubscription->oid,
> + !am_tablesync_worker());
> +
> + PG_RE_THROW();
> + }
> 
> (same as #3 but comment says "while applying changes")
> 
> SUGGESTED
> 
> /*
> * Report the worker failed while applying changing. Abort the current
> * transaction so that the stats message is sent in an idle state.
> */
> AbortOutOfAnyTransaction();
> pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> er());
Fixed. I choose the woring "while applying changes" which you mentioned first
and sounds more natural.


[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373B74627C6BAF2F146D779ED099%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Kindly have a look at v30.
>

Review comments:
===============
1.
+ ereport(LOG,
+ errmsg("logical replication subscription \"%s\" has been be disabled
due to an error",

Typo.
/been be/been

2. Is there a reason the patch doesn't allow workers to restart via
maybe_reread_subscription() when this new option is changed, if so,
then let's add a comment for the same? We currently seem to be
restarting the worker on any change via Alter Subscription. If we
decide to change it for this option as well then I think we need to
accordingly update the current comment: "Exit if any parameter that
affects the remote connection was changed." to something like "Exit if
any parameter that affects the remote connection or a subscription
option was changed..."

3.
  if (fout->remoteVersion >= 150000)
- appendPQExpBufferStr(query, " s.subtwophasestate\n");
+ appendPQExpBufferStr(query, " s.subtwophasestate,\n");
  else
  appendPQExpBuffer(query,
-   " '%c' AS subtwophasestate\n",
+   " '%c' AS subtwophasestate,\n",
    LOGICALREP_TWOPHASE_STATE_DISABLED);

+ if (fout->remoteVersion >= 150000)
+ appendPQExpBuffer(query, " s.subdisableonerr\n");
+ else
+ appendPQExpBuffer(query,
+   " false AS subdisableonerr\n");

It is better to combine these parameters. I see there is a similar
coding pattern for 14 but I think that is not required.

4.
+$node_subscriber->safe_psql('postgres', qq(ALTER SUBSCRIPTION sub ENABLE));
+
+# Wait for the data to replicate.
+$node_subscriber->poll_query_until(
+ 'postgres', qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr
+WHERE sr.srsubstate IN ('s', 'r') AND sr.srrelid = 'tbl'::regclass));

See other scripts like t/015_stream.pl and wait for data replication
in the same way. I think there are two things to change: (a) In the
above query, we use NOT IN at other places (b) use
$node_publisher->wait_for_catchup before this query.

-- 
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Tue, Mar 8, 2022 at 5:07 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Kindly have a look at v30.

Thank you for updating the patch. Here are some comments:

+   /*
+    * Allocate the origin name in long-lived context for error context
+    * message.
+    */
+   ReplicationOriginNameForTablesync(MySubscription->oid,
+                                     MyLogicalRepWorker->relid,
+                                     originname,
+                                     sizeof(originname));
+   apply_error_callback_arg.origin_name = MemoryContextStrdup(ApplyContext,
+                                                              originname);

I think it's better to set apply_error_callback_arg.origin_name in the
caller rather than in start_table_sync(). Apply workers set
apply_error_callback_arg.origin_name there and it's not necessarily
necessary to do that in this function.

Even if we want to do that, I think it's not necessary to pass
originname to start_table_sync(). It's a local variable and used only
to temporarily store the tablesync worker's origin name.

---
It might have already been discussed but the worker disables the
subscription on an error but doesn't work for a fatal. Is that
expected or should we handle that too?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> ---
> It might have already been discussed but the worker disables the
> subscription on an error but doesn't work for a fatal. Is that
> expected or should we handle that too?
>

I am not too sure about handling FATALs with this feature because this
is mainly to aid in resolving conflicts due to various constraints. It
might be okay to retry in case of FATAL which is possibly due to some
system resource error. OTOH, if we see that it will be good to disable
for a FATAL error as well then I think we can use
PG_ENSURE_ERROR_CLEANUP construct. What do you think?

-- 
With Regards,
Amit Kapila.



On Tue, Mar 8, 2022 at 6:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Kindly have a look at v30.
> >
>
> Review comments:
> ===============
>
Few comments on test script:
=======================
1.
+# This tests the uniqueness violation will cause the subscription
+# to fail during initial synchronization and make it disabled.

/This tests the/This tests that the

2.
+$node_publisher->safe_psql('postgres',
+ qq(CREATE PUBLICATION pub FOR TABLE tbl));
+$node_subscriber->safe_psql(
+ 'postgres', qq(
+CREATE SUBSCRIPTION sub
+CONNECTION '$publisher_connstr'
+PUBLICATION pub WITH (disable_on_error = true)));

Please check other test scripts like t/015_stream.pl or
t/028_row_filter.pl and keep the indentation of these commands
similar. It looks odd and inconsistent with other tests. Also, we can
use double-quotes instead of qq so as to be consistent with other
scripts. Please check other similar places and make them consistent
with other test script files.

3.
+# Initial synchronization failure causes the subscription
+# to be disabled.

Here and in other places in test scripts, the comment lines seem too
short to me. Normally, we can keep it at the 80 char limit but this
appears too short.

4.
+# Delete the data from the subscriber and recreate the unique index.
+$node_subscriber->safe_psql(
+ 'postgres', q(
+DELETE FROM tbl;
+CREATE UNIQUE INDEX tbl_unique ON tbl (i)));

In other tests, we are executing single statements via safe_psql. I
don't see a problem with this but also don't see a reason to deviate
from the normal pattern.

-- 
With Regards,
Amit Kapila.



Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Wed, Mar 9, 2022 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > ---
> > It might have already been discussed but the worker disables the
> > subscription on an error but doesn't work for a fatal. Is that
> > expected or should we handle that too?
> >
>
> I am not too sure about handling FATALs with this feature because this
> is mainly to aid in resolving conflicts due to various constraints. It
> might be okay to retry in case of FATAL which is possibly due to some
> system resource error. OTOH, if we see that it will be good to disable
> for a FATAL error as well then I think we can use
> PG_ENSURE_ERROR_CLEANUP construct. What do you think?

I think that since FATAL raised by logical replication workers (e.g.,
terminated by DDL or out of memory etc?) is normally not a repeatable
error, it's reasonable to retry in this case.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



On Wed, Mar 9, 2022 at 11:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > ---
> > > It might have already been discussed but the worker disables the
> > > subscription on an error but doesn't work for a fatal. Is that
> > > expected or should we handle that too?
> > >
> >
> > I am not too sure about handling FATALs with this feature because this
> > is mainly to aid in resolving conflicts due to various constraints. It
> > might be okay to retry in case of FATAL which is possibly due to some
> > system resource error. OTOH, if we see that it will be good to disable
> > for a FATAL error as well then I think we can use
> > PG_ENSURE_ERROR_CLEANUP construct. What do you think?
>
> I think that since FATAL raised by logical replication workers (e.g.,
> terminated by DDL or out of memory etc?) is normally not a repeatable
> error, it's reasonable to retry in this case.
>

Yeah, I think we can add a comment in the code for this so that future
readers know that this has been done deliberately.

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, March 9, 2022 3:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Mar 9, 2022 at 11:22 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 6:29 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > ---
> > > > It might have already been discussed but the worker disables the
> > > > subscription on an error but doesn't work for a fatal. Is that
> > > > expected or should we handle that too?
> > > >
> > >
> > > I am not too sure about handling FATALs with this feature because
> > > this is mainly to aid in resolving conflicts due to various
> > > constraints. It might be okay to retry in case of FATAL which is
> > > possibly due to some system resource error. OTOH, if we see that it
> > > will be good to disable for a FATAL error as well then I think we
> > > can use PG_ENSURE_ERROR_CLEANUP construct. What do you think?
> >
> > I think that since FATAL raised by logical replication workers (e.g.,
> > terminated by DDL or out of memory etc?) is normally not a repeatable
> > error, it's reasonable to retry in this case.
> >
> 
> Yeah, I think we can add a comment in the code for this so that future readers
> know that this has been done deliberately.
OK. I've added some comments in the codes.

The v31 addressed other comments on hackers so far.
(a) brush up the TAP test alignment
(b) fix the place of apply_error_callback_arg.origin_name for table sync worker
(c) modify maybe_reread_subscription to exit, when disable_on_error changes
(d) improve getSubscriptions to combine some branches for v15

Kindly check the attached v31.


Best Regards,
    Takamichi Osumi


Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, March 9, 2022 1:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 8, 2022 at 6:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > Kindly have a look at v30.
> > >
> >
> > Review comments:
> > ===============
Thank you for reviewing !


> Few comments on test script:
> =======================
> 1.
> +# This tests the uniqueness violation will cause the subscription # to
> +fail during initial synchronization and make it disabled.
> 
> /This tests the/This tests that the
Fixed.


> 2.
> +$node_publisher->safe_psql('postgres',
> + qq(CREATE PUBLICATION pub FOR TABLE tbl));
> +$node_subscriber->safe_psql(  'postgres', qq( CREATE SUBSCRIPTION
> sub
> +CONNECTION '$publisher_connstr'
> +PUBLICATION pub WITH (disable_on_error = true)));
> 
> Please check other test scripts like t/015_stream.pl or t/028_row_filter.pl and
> keep the indentation of these commands similar. It looks odd and inconsistent
> with other tests. Also, we can use double-quotes instead of qq so as to be
> consistent with other scripts. Please check other similar places and make
> them consistent with other test script files.
Fixed the inconsistent indentations within each commands.
Also, replace the qq with double-quotes (except for the is()'s
2nd argument, which is the aligned way to write the tests).



> 3.
> +# Initial synchronization failure causes the subscription # to be
> +disabled.
> 
> Here and in other places in test scripts, the comment lines seem too short to
> me. Normally, we can keep it at the 80 char limit but this appears too short.
Fixed.


> 4.
> +# Delete the data from the subscriber and recreate the unique index.
> +$node_subscriber->safe_psql(
> + 'postgres', q(
> +DELETE FROM tbl;
> +CREATE UNIQUE INDEX tbl_unique ON tbl (i)));
> 
> In other tests, we are executing single statements via safe_psql. I don't see a
> problem with this but also don't see a reason to deviate from the normal
> pattern.
Fixed.


At the same time, I fixed one comment
where I should write "subscriber", not "sub",
since in the entire test file, I express the subscriber
by using the former.


The new patch v31 is shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, March 9, 2022 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Mar 8, 2022 at 5:07 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Kindly have a look at v30.
> 
> Thank you for updating the patch. Here are some comments:
Hi, thank you for your review !


> +   /*
> +    * Allocate the origin name in long-lived context for error context
> +    * message.
> +    */
> +   ReplicationOriginNameForTablesync(MySubscription->oid,
> +                                     MyLogicalRepWorker->relid,
> +                                     originname,
> +                                     sizeof(originname));
> +   apply_error_callback_arg.origin_name =
> MemoryContextStrdup(ApplyContext,
> +
> + originname);
> 
> I think it's better to set apply_error_callback_arg.origin_name in the caller
> rather than in start_table_sync(). Apply workers set
> apply_error_callback_arg.origin_name there and it's not necessarily necessary
> to do that in this function.
OK. I made this origin_name logic back to the level of ApplyWorkerMain.


The new patch v31 is shared in [1].


[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regardfs,
    Takamichi Osumi


RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, March 8, 2022 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Kindly have a look at v30.
> >
> 
> Review comments:
Thank you for checking !


> ===============
> 1.
> + ereport(LOG,
> + errmsg("logical replication subscription \"%s\" has been be disabled
> due to an error",
> 
> Typo.
> /been be/been
Fixed.

 
> 2. Is there a reason the patch doesn't allow workers to restart via
> maybe_reread_subscription() when this new option is changed, if so, then let's
> add a comment for the same? We currently seem to be restarting the worker on
> any change via Alter Subscription. If we decide to change it for this option as
> well then I think we need to accordingly update the current comment: "Exit if
> any parameter that affects the remote connection was changed." to something
> like "Exit if any parameter that affects the remote connection or a subscription
> option was changed..."
I thought it's ok without the change at the beginning, but I was wrong.
To make this new option aligned with others, I should add one check
for this feature. Fixed.


> 3.
>   if (fout->remoteVersion >= 150000)
> - appendPQExpBufferStr(query, " s.subtwophasestate\n");
> + appendPQExpBufferStr(query, " s.subtwophasestate,\n");
>   else
>   appendPQExpBuffer(query,
> -   " '%c' AS subtwophasestate\n",
> +   " '%c' AS subtwophasestate,\n",
>     LOGICALREP_TWOPHASE_STATE_DISABLED);
> 
> + if (fout->remoteVersion >= 150000)
> + appendPQExpBuffer(query, " s.subdisableonerr\n"); else
> + appendPQExpBuffer(query,
> +   " false AS subdisableonerr\n");
> 
> It is better to combine these parameters. I see there is a similar coding pattern
> for 14 but I think that is not required.
Fixed and combined them together.

 
> 4.
> +$node_subscriber->safe_psql('postgres', qq(ALTER SUBSCRIPTION sub
> +ENABLE));
> +
> +# Wait for the data to replicate.
> +$node_subscriber->poll_query_until(
> + 'postgres', qq(
> +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> +sr.srsubstate IN ('s', 'r') AND sr.srrelid = 'tbl'::regclass));
> 
> See other scripts like t/015_stream.pl and wait for data replication in the same
> way. I think there are two things to change: (a) In the above query, we use NOT
> IN at other places (b) use $node_publisher->wait_for_catchup before this
> query.
Fixed.

The new patch is shared in [1].

[1] -
https://www.postgresql.org/message-id/TYCPR01MB8373824855A6C4D2178027A0ED0A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


Re: Optionally automatically disable logical replication subscriptions on error

From
Masahiko Sawada
Date:
On Wed, Mar 9, 2022 at 4:33 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Tuesday, March 8, 2022 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
>
>
> > 2. Is there a reason the patch doesn't allow workers to restart via
> > maybe_reread_subscription() when this new option is changed, if so, then let's
> > add a comment for the same? We currently seem to be restarting the worker on
> > any change via Alter Subscription. If we decide to change it for this option as
> > well then I think we need to accordingly update the current comment: "Exit if
> > any parameter that affects the remote connection was changed." to something
> > like "Exit if any parameter that affects the remote connection or a subscription
> > option was changed..."
> I thought it's ok without the change at the beginning, but I was wrong.
> To make this new option aligned with others, I should add one check
> for this feature. Fixed.

Why do we need to restart the apply worker when disable_on_error is
changed? It doesn't affect the remote connection at all. I think it
can be changed without restarting like synchronous_commit option.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



On Wed, Mar 9, 2022 at 2:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 4:33 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Tuesday, March 8, 2022 10:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> >
> >
> > > 2. Is there a reason the patch doesn't allow workers to restart via
> > > maybe_reread_subscription() when this new option is changed, if so, then let's
> > > add a comment for the same? We currently seem to be restarting the worker on
> > > any change via Alter Subscription. If we decide to change it for this option as
> > > well then I think we need to accordingly update the current comment: "Exit if
> > > any parameter that affects the remote connection was changed." to something
> > > like "Exit if any parameter that affects the remote connection or a subscription
> > > option was changed..."
> > I thought it's ok without the change at the beginning, but I was wrong.
> > To make this new option aligned with others, I should add one check
> > for this feature. Fixed.
>
> Why do we need to restart the apply worker when disable_on_error is
> changed? It doesn't affect the remote connection at all. I think it
> can be changed without restarting like synchronous_commit option.
>

oh right, I thought that how will we update its value in
MySubscription after a change but as we re-read the pg_subscription
table for the current subscription and update MySubscription, I feel
we don't need to restart it. I haven't tested it but it should work
without a restart.

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, March 9, 2022 8:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Mar 9, 2022 at 2:21 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 4:33 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > On Tuesday, March 8, 2022 10:23 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > > On Tue, Mar 8, 2022 at 1:37 PM osumi.takamichi@fujitsu.com
> > > > <osumi.takamichi@fujitsu.com> wrote:
> > > > >
> > >
> > >
> > > > 2. Is there a reason the patch doesn't allow workers to restart
> > > > via
> > > > maybe_reread_subscription() when this new option is changed, if
> > > > so, then let's add a comment for the same? We currently seem to be
> > > > restarting the worker on any change via Alter Subscription. If we
> > > > decide to change it for this option as well then I think we need
> > > > to accordingly update the current comment: "Exit if any parameter
> > > > that affects the remote connection was changed." to something like
> > > > "Exit if any parameter that affects the remote connection or a
> subscription option was changed..."
> > > I thought it's ok without the change at the beginning, but I was wrong.
> > > To make this new option aligned with others, I should add one check
> > > for this feature. Fixed.
> >
> > Why do we need to restart the apply worker when disable_on_error is
> > changed? It doesn't affect the remote connection at all. I think it
> > can be changed without restarting like synchronous_commit option.
> >
> 
> oh right, I thought that how will we update its value in MySubscription after a
> change but as we re-read the pg_subscription table for the current
> subscription and update MySubscription, I feel we don't need to restart it. I
> haven't tested it but it should work without a restart.
Hi, attached v32 removed my additional code for maybe_reread_subscription.

Also, I judged that we don't need to add a comment for this feature in this patch.
It's because we can interpret this discussion from existing comments and codes.
(1) "Reread subscription info if needed. Most changes will be exit."
    There are some cases we don't exit.
(2) Like "Exit if any parameter that affects the remote connection was changed.",
    readers can understand no exit case matches the disable_on_error option change.

Kindly review the v32.

Best Regards,
    Takamichi Osumi


Attachment
On Wed, Mar 9, 2022 at 7:57 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Hi, attached v32 removed my additional code for maybe_reread_subscription.
>

Thanks, the patch looks good to me. I have made minor edits in the
attached. I am planning to commit this early next week (Monday) unless
there are any other major comments.

-- 
With Regards,
Amit Kapila.

Attachment
On Thu, Mar 10, 2022 at 12:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 7:57 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > Hi, attached v32 removed my additional code for maybe_reread_subscription.
> >
>
> Thanks, the patch looks good to me. I have made minor edits in the
> attached. I am planning to commit this early next week (Monday) unless
> there are any other major comments.
>

Pushed.

-- 
With Regards,
Amit Kapila.



RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, March 14, 2022 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 10, 2022 at 12:04 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Wed, Mar 9, 2022 at 7:57 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > Hi, attached v32 removed my additional code for
> maybe_reread_subscription.
> > >
> >
> > Thanks, the patch looks good to me. I have made minor edits in the
> > attached. I am planning to commit this early next week (Monday) unless
> > there are any other major comments.
> >
> 
> Pushed.
Thank you so much !


Best Regards,
    Takamichi Osumi


Re: Optionally automatically disable logical replication subscriptions on error

From
Nathan Bossart
Date:
My compiler is worried that syncslotname may be used uninitialized in
start_table_sync().  The attached patch seems to silence this warning.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

RE: Optionally automatically disable logical replication subscriptions on error

From
"osumi.takamichi@fujitsu.com"
Date:
On Tuesday, March 15, 2022 8:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My compiler is worried that syncslotname may be used uninitialized in
> start_table_sync().  The attached patch seems to silence this warning.
Thank you for your reporting !

Your fix looks good to me.


Best Regards,
    Takamichi Osumi