Re: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Optionally automatically disable logical replication subscriptions on error |
Date | |
Msg-id | CAHut+PszB2hpYOMzLXmCjPkd1s49TfmYvFYQD4JmEaWUrzkNnw@mail.gmail.com Whole thread Raw |
In response to | Optionally automatically disable logical replication subscriptions on error (Mark Dilger <mark.dilger@enterprisedb.com>) |
Responses |
Re: Optionally automatically disable logical replication subscriptions on error
|
List | pgsql-hackers |
On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Hackers, > > Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply achange, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log,and restarting. > > As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superuserscreates the potential for even more errors to be triggered, specifically, errors where the apply worker doesnot have permission to make changes to the target table. > > The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", suchthat rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription,breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided,so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the errorafter fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command. > > In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler. I originallyran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker logfile to determine whether permission failures were occurring and what they were. It was also obnoxiously easy to havea test get stuck waiting for a permanently stuck subscription to catch up. This helps with both issues. > > I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design changes. I tried your patch. It applied OK (albeit with whitespace warnings). The code build and TAP tests are all OK. Below are a few comments and observations. COMMENTS ======== (1) PG Docs catalogs.sgml Documented new column "suberrmsg" but did not document the other new columns ("disable_on_error", "disabled_by_error")? ------ (2) New column "disabled_by_error". I wondered if there was actually any need for this column. Isn't the same information conveyed by just having "subenabled" = false, at same time as as non-empty "suberrmsg"? This would remove any confusion for having 2 booleans which both indicate disabled. ------ (3) New columns "disabled_by_error", "disabled_on_error". All other columns of the pg_subscription have a "sub" prefix. ------ (4) errhint member used? @@ -91,12 +100,16 @@ typedef struct Subscription char *name; /* Name of the subscription */ Oid owner; /* Oid of the subscription owner */ bool enabled; /* Indicates if the subscription is enabled */ + bool disable_on_error; /* Whether errors automatically disable */ + bool disabled_by_error; /* Whether an error has disabled */ bool binary; /* Indicates if the subscription wants data in * binary format */ bool stream; /* Allow streaming in-progress transactions. */ char *conninfo; /* Connection string to the publisher */ char *slotname; /* Name of the replication slot */ char *synccommit; /* Synchronous commit setting for worker */ + char *errmsg; /* Message from error which disabled */ + char *errhint; /* Hint from error which disabled */ List *publications; /* List of publication names to subscribe to */ } Subscription; I did not find any code using that newly added member "errhint". ------ (5) dump.c i. No mention of new columns "disabled_on_error" and "disabled_by_error". Is that right? ii. Shouldn't the code for the "suberrmsg" be qualified with some PG version number checks? ------ (6) Patch only handles errors only from the Apply worker. Tablesync can give similar errors (e.g. PK violation during DATASYNC phase) which will trigger re-launch forever regardless of the setting of "disabled_on_error". (confirmed by observations below) ------ (7) TAP test code +$node_subscriber->init(allows_streaming => 'logical'); AFAIK that "logical" configuration is not necessary for the subscriber side. So, $node_subscriber->init(); //////////// Some Experiments/Observations ============================== In general, I found this functionality is useful and it works as advertised by your patch comment. ====== Test: Display pg_subscription with the new columns Observation: As expected. But some new colnames are not prefixed like their peers. test_sub=# \pset x Expanded display is on. test_sub=# select * from pg_subscription; -[ RECORD 1 ]-----+-------------------------------------------------------- oid | 16394 subdbid | 16384 subname | tap_sub subowner | 10 subenabled | t disable_on_error | t disabled_by_error | f subbinary | f substream | f subconninfo | host=localhost dbname=test_pub application_name=tap_sub subslotname | tap_sub subsynccommit | off suberrmsg | subpublications | {tap_pub} ====== Test: Cause a PK violation during normal Apply replication (when "disabled_on_error=true") Observation: Apply worker stops. Subscription is disabled. Error message is in the catalog. 2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for subscription 'tap_sub': message = "duplicate key value violates unique constraint "test_tab_pkey"", hint = "<NONE>" 2021-06-18 15:12:45.905 AEST [25904] LOG: logical replication apply worker for subscription "tap_sub" will stop because the subscription was disabled due to error 2021-06-18 15:12:45.905 AEST [25904] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:12:45.905 AEST [25904] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:12:45.908 AEST [19924] LOG: background worker "logical replication worker" (PID 25904) exited with exit code 1 test_sub=# select * from pg_subscription; -[ RECORD 1 ]-----+--------------------------------------------------------------- oid | 16394 subdbid | 16384 subname | tap_sub subowner | 10 subenabled | f disable_on_error | t disabled_by_error | t subbinary | f substream | f subconninfo | host=localhost dbname=test_pub application_name=tap_sub subslotname | tap_sub subsynccommit | off suberrmsg | duplicate key value violates unique constraint "test_tab_pkey" subpublications | {tap_pub} ====== Test: Try to enable subscription (without fixing the PK violation problem). Observation. OK. It just stops again test_sub=# alter subscription tap_sub enable; ALTER SUBSCRIPTION test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-06-18 15:17:18.078 AEST [10228] LOG: edata is true for subscription 'tap_sub': message = "duplicate key value violates unique constraint "test_tab_pkey"", hint = "<NONE>" 2021-06-18 15:17:18.078 AEST [10228] LOG: logical replication apply worker for subscription "tap_sub" will stop because the subscription was disabled due to error 2021-06-18 15:17:18.078 AEST [10228] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:17:18.078 AEST [10228] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:17:18.079 AEST [19924] LOG: background worker "logical replication worker" (PID 10228) exited with exit code 1 ====== Test: Manually disable the subscription (which had previously already been disabled due to error) Observation: OK. The suberrmsg gets reset to an empty string. alter subscription tap_sub disable; ===== Test: Turn off the disable_on_error Observation: As expected, now the Apply worker goes into re-launch forever loop every time it hits PK violation test_sub=# alter subscription tap_sub set (disable_on_error=false); ALTER SUBSCRIPTION ... ====== Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase. (when disable_on_error = true) Observation: This patch changes nothing for this case. The Tablesyn re-launchs in a forever loop the same as current functionality. test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH (disable_on_error=false); NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-06-18 15:38:19.557 AEST [18207] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-06-18 15:38:19.610 AEST [18207] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:38:19.610 AEST [18207] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:38:19.610 AEST [18207] CONTEXT: COPY test_tab, line 1 2021-06-18 15:38:19.611 AEST [19924] LOG: background worker "logical replication worker" (PID 18207) exited with exit code 1 2021-06-18 15:38:24.634 AEST [18369] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-06-18 15:38:24.689 AEST [18369] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:38:24.689 AEST [18369] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:38:24.689 AEST [18369] CONTEXT: COPY test_tab, line 1 2021-06-18 15:38:24.690 AEST [19924] LOG: background worker "logical replication worker" (PID 18369) exited with exit code 1 2021-06-18 15:38:29.701 AEST [18521] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-06-18 15:38:29.765 AEST [18521] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:38:29.765 AEST [18521] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:38:29.765 AEST [18521] CONTEXT: COPY test_tab, line 1 2021-06-18 15:38:29.766 AEST [19924] LOG: background worker "logical replication worker" (PID 18521) exited with exit code 1 etc... -[ RECORD 1 ]-----+-------------------------------------------------------- oid | 16399 subdbid | 16384 subname | tap_sub subowner | 10 subenabled | t disable_on_error | f disabled_by_error | f subbinary | f substream | f subconninfo | host=localhost dbname=test_pub application_name=tap_sub subslotname | tap_sub subsynccommit | off suberrmsg | subpublications | {tap_pub} ------ Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: