Re: Optionally automatically disable logical replication subscriptions on error - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Optionally automatically disable logical replication subscriptions on error
Date
Msg-id CAD21AoBMxAcbjwrE5Xa9AziXLVMODL4taAGhdEUw+-5AdzDTuA@mail.gmail.com
Whole thread Raw
In response to RE: Optionally automatically disable logical replication subscriptions on error  ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
Responses Re: Optionally automatically disable logical replication subscriptions on error  (Peter Smith <smithpb2250@gmail.com>)
RE: Optionally automatically disable logical replication subscriptions on error  ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
List pgsql-hackers
On Wed, Mar 2, 2022 at 6:38 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > After more thoughts, should we do both AbortOutOfAnyTransaction() and error
> > message handling while holding interrupts? That is,
> >
> > HOLD_INTERRUPTS();
> > EmitErrorReport();
> > FlushErrorState();
> > AbortOutOfAny Transaction();
> > RESUME_INTERRUPTS();
> > pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> > er());
> >
> > I think it's better that we do clean up first and then do other works such as
> > sending the message to the stats collector and updating the catalog.
> I agree. Fixed. Along with this change, I corrected the header comment of
> DisableSubscriptionOnError, too.
>
>
> > Here are some comments on v24 patch:
> >
> > +        /* Look up our subscription in the catalogs */
> > +        tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
> > +
> > CStringGetDatum(MySubscription->name));
> >
> > s/catalogs/catalog/
> >
> > Why don't we use SUBSCRIPTIONOID with MySubscription->oid?
> Changed.
>
>
> > ---
> > +        if (!HeapTupleIsValid(tup))
> > +                ereport(ERROR,
> > +                                errcode(ERRCODE_UNDEFINED_OBJECT),
> > +                                errmsg("subscription \"%s\" does not
> > exist",
> > +                                           MySubscription->name));
> >
> > I think we should use elog() here rather than ereport() since it's a
> > should-not-happen error.
> Fixed
>
>
> > ---
> > +        /* Notify the subscription will be no longer valid */
> >
> > I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
> > subscription is still valid actually, but just disabled).
> Fixed. Also, I've made this sentence past one, because of the code place
> change below.
>
>
> > ---
> > +        /* Notify the subscription will be no longer valid */
> > +        ereport(LOG,
> > +                        errmsg("logical replication subscription
> > \"%s\" will be disabled due to an error",
> > +                                   MySubscription->name));
> > +
> >
> > I think we can report the log at the end of this function rather than during the
> > transaction.
> Fixed. In this case, I needed to adjust the comment to indicate the processing
> to disable the sub has *completed* as well.
>
> > ---
> > +my $cmd = qq(
> > +CREATE TABLE tbl (i INT);
> > +ALTER TABLE tbl REPLICA IDENTITY FULL;
> > +CREATE INDEX tbl_idx ON tbl(i));
> >
> > I think we don't need to set REPLICA IDENTITY FULL to this table since there is
> > notupdate/delete.
> >
> > Do we need tbl_idx?
> Removed both the replica identity and tbl_idx;
>
>
> > ---
> > +$cmd = qq(
> > +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> > +sr.srsubstate IN ('s', 'r'));
> > +$node_subscriber->poll_query_until('postgres', $cmd);
> >
> > It seems better to add a condition of srrelid just in case.
> Makes sense. Fixed.
>
>
> > ---
> > +$cmd = qq(
> > +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
> > s.subname =
> > +'sub' AND s.subenabled IS FALSE);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > +  or die "Timed out while waiting for subscriber to be disabled";
> >
> > I think that it's more natural to directly check the subscription's subenabled.
> > For example:
> >
> > SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';
> Fixed. I modified another code similar to this for tablesync error as well.
>
>
> > ---
> > +$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
> > +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
> > COUNT(1)
> > += 3 FROM tbl WHERE i = 3);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > +  or die "Timed out while waiting for applying";
> >
> > I think it's better to wait for the subscriber to catch up and check the query
> > result instead of using poll_query_until() so that we can check the query result
> > in case where the test fails.
> I modified the code to wait for the subscriber and deleted poll_query_until.
> Also, when I consider the test failure for this test as you mentioned,
> expecting and checking the number of return value that equals 3
> would be better. So, I expressed this point in my test as well,
> according to your advice.
>
>
> > ---
> > +$cmd = qq(DROP INDEX tbl_unique);
> > +$node_subscriber->safe_psql('postgres', $cmd);
> >
> > In the newly added tap tests, all queries are consistently assigned to $cmd and
> > executed even when the query is used only once. It seems a different style from
> > the one in other tap tests. Is there any reason why we do this style for all queries
> > in the tap tests?
> Fixed. I removed the 'cmd' variable itself.
>
>
> Attached an updated patch v26.

Thank you for updating the patch.

Here are some comments on v26 patch:

+/*
+ * Disable the current subscription.
+ */
+static void
+DisableSubscriptionOnError(void)

This function now just updates the pg_subscription catalog so can we
move it to pg_subscritpion.c while having this function accept the
subscription OID to disable? If you agree, the function comment will
also need to be updated.

---
+                /*
+                 * First, ensure that we log the error message so
that it won't be
+                 * lost if some other internal error occurs in the
following code.
+                 * Then, abort the current transaction and send the
stats message of
+                 * the table synchronization failure in an idle state.
+                 */
+                HOLD_INTERRUPTS();
+                EmitErrorReport();
+                FlushErrorState();
+                AbortOutOfAnyTransaction();
+                RESUME_INTERRUPTS();
+                pgstat_report_subscription_error(MySubscription->oid, false);
+
+                if (MySubscription->disableonerr)
+                {
+                        DisableSubscriptionOnError();
+                        proc_exit(0);
+                }
+
+                PG_RE_THROW();

If the disableonerr is false, the same error is reported twice. Also,
the code in PG_CATCH() in both start_apply() and start_table_sync()
are almost the same. Can we create a common function to do post-error
processing?

The worker should exit with return code 1.

I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.

---
+
+# Confirm that we have finished the table sync.
+is( $node_subscriber->safe_psql(
+                'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)),
+        "1|3",
+        "subscription sub replicated data");
+

Can we store the result to a local variable to check? I think it's
more consistent with other tap tests.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: pg_tablespace_location() failure with allow_in_place_tablespaces
Next
From: Michael Paquier
Date:
Subject: Re: wal_compression=zstd