Re: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: Optionally automatically disable logical replication subscriptions on error
Date
Msg-id 38706E50-9190-4623-9FD7-4E9E3DD3F81A@enterprisedb.com
Whole thread Raw
In response to Re: Optionally automatically disable logical replication subscriptions on error  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers

> On Jun 17, 2021, at 11:34 PM, Peter Smith <smithpb2250@gmail.com> wrote:
>
> I tried your patch.

Thanks for the quick and thorough review!

> (2) New column "disabled_by_error".
>
> I wondered if there was actually any need for this column. Isn't the
> same information conveyed by just having "subenabled" = false, at same
> time as as non-empty "suberrmsg"? This would remove any confusion for
> having 2 booleans which both indicate disabled.

Yeah, I wondered about that before posting v1.  I removed the disabled_by_error field for v2.

> (3) New columns "disabled_by_error", "disabled_on_error".
>
> All other columns of the pg_subscription have a "sub" prefix.

I don't feel strongly about this.  How about "subdisableonerr"?  I used that in v2.

> I did not find any code using that newly added member "errhint".

Thanks for catching that.  I had tried to remove all references to "errhint" before posting v1.  The original idea was
thatboth the message and hint of the error would be kept, but in testing I found the hint field was typically empty, so
Iremoved it.  Sorry that I left one mention of it lying around. 

> (5) dump.c

I didn't bother getting pg_dump working before posting v1, and I still have not done so, as I mainly want to solicit
feedbackon whether the basic direction I am going will work for the community. 

> (6) Patch only handles errors only from the Apply worker.
>
> Tablesync can give similar errors (e.g. PK violation during DATASYNC
> phase) which will trigger re-launch forever regardless of the setting
> of "disabled_on_error".
> (confirmed by observations below)

Yes, this is a good point, and also mentioned by Amit.  I have fixed it in v2 and adjusted the regression test to
triggeran automatic disabling for initial table sync as well as for change replication. 

> 2021-06-18 15:12:45.905 AEST [25904] LOG:  edata is true for
> subscription 'tap_sub': message = "duplicate key value violates unique
> constraint "test_tab_pkey"", hint = "<NONE>"

You didn't call this out, but FYI, I don't intend to leave this particular log message in the patch.  It was for
developmentonly.  I have removed it for v2 and have added a different log message much sooner after catching the error,
toavoid squashing the error in case some other action fails. 

The regression test shows this, if you open tmp_check/log/022_disable_on_error_subscriber.log:

2021-06-18 16:25:20.138 PDT [56926] LOG:  logical replication subscription "s1" will be disabled due to error:
duplicatekey value violates unique constraint "s1_tbl_unique" 
2021-06-18 16:25:20.139 PDT [56926] ERROR:  duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL:  Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT:  COPY tbl, line 2

The first line logs the error prior to attempting to disable the subscription, and the next three lines are due to
rethrowingthe error after committing the successful disabling of the subscription.  If the attempt to disable the
subscriptionitself throws, these additional three lines won't show up, but the first one should.  Amit mentioned this
upthread. Do you think this will be ok, or would you like to also have a suberrdetail field so that the detail doesn't
getlost?  I haven't added such an extra field, and am inclined to think it would be excessive, but maybe others feel
differently?


> ======
>
> Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
> (when disable_on_error = true)
> Observation: This patch changes nothing for this case. The Tablesyn
> re-launchs in a forever loop the same as current functionality.

In v2, tablesync copy errors should also be caught.  The test has been extended to cover this also.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

pgsql-hackers by date:

Previous
From: Greg Sabino Mullane
Date:
Subject: Re: Speed up pg_checksums in cases where checksum already set
Next
From: Michael Paquier
Date:
Subject: Re: pgbench logging broken by time logic changes