Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Skipping logical replication transactions on subscriber side
Date
Msg-id CAD21AoCHDWTy-BcLpeFoZE9P9pe1Sjg-E=uY1mC0K=jvdfXb=g@mail.gmail.com
Whole thread Raw
In response to Re: Skipping logical replication transactions on subscriber side  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Skipping logical replication transactions on subscriber side
List pgsql-hackers
On Wed, Oct 27, 2021 at 7:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 21, 2021 at 10:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > I've attached updated patches.

Thank you for the comments!

>
> Few comments:
> ==============
> 1. Is the patch cleaning tablesync error entries except via vacuum? If
> not, can't we send a message to remove tablesync errors once tablesync
> is successful (say when we reset skip_xid or when tablesync is
> finished) or when we drop subscription? I think the same applies to
> apply worker. I think we may want to track it in some way whether an
> error has occurred before sending the message but relying completely
> on a vacuum might be the recipe of bloat. I think in the case of a
> drop subscription we can simply send the message as that is not a
> frequent operation. I might be missing something here because in the
> tests after drop subscription you are expecting the entries from the
> view to get cleared

Yes, I think we can have tablesync worker send a message to drop stats
once tablesync is successful. But if we do that also when dropping a
subscription, I think we need to do that only the transaction is
committed since we can drop a subscription that doesn't have a
replication slot and rollback the transaction. Probably we can send
the message only when the subscritpion does have a replication slot.

In other cases, we can remember the subscriptions being dropped and
send the message to drop the statistics of them after committing the
transaction but I’m not sure it’s worth having it. FWIW, we completely
rely on pg_stat_vacuum_stats() for cleaning up the dead tables and
functions. And we don't expect there are many subscriptions on the
database.

What do you think?

>
> 2.
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>count</structfield> <type>uint8</type>
> +      </para>
> +      <para>
> +       Number of consecutive times the error occurred
> +      </para></entry>
>
> Shall we name this field as error_count as there will be other fields
> in this view in the future that may not be directly related to the
> error?

Agreed.

>
> 3.
> +
> +CREATE VIEW pg_stat_subscription_workers AS
> +    SELECT
> + e.subid,
> + s.subname,
> + e.subrelid,
> + e.relid,
> + e.command,
> + e.xid,
> + e.count,
> + e.error_message,
> + e.last_error_time,
> + e.stats_reset
> +    FROM (SELECT
> +              oid as subid,
> +              NULL as relid
> +          FROM pg_subscription
> +          UNION ALL
> +          SELECT
> +              srsubid as subid,
> +              srrelid as relid
> +          FROM pg_subscription_rel
> +          WHERE srsubstate <> 'r') sr,
> +          LATERAL pg_stat_get_subscription_worker(sr.subid, sr.relid) e
>
> It might be better to use 'w' as an alias instead of 'e' as the
> information is now not restricted to only errors.

Agreed.

>
> 4. +# Test if the error reported on pg_subscription_workers view is expected.
>
> The view name is wrong in the above comment

Fixed.

>
> 5.
> +# Check if the view doesn't show any entries after dropping the subscriptions.
> +$node_subscriber->safe_psql(
> +    'postgres',
> +    q[
> +DROP SUBSCRIPTION tap_sub;
> +DROP SUBSCRIPTION tap_sub_streaming;
> +]);
> +$result = $node_subscriber->safe_psql('postgres',
> +       "SELECT count(1) FROM pg_stat_subscription_workers");
> +is($result, q(0), 'no error after dropping subscription');
>
> Don't we need to wait after dropping the subscription and before
> checking the view as there might be a slight delay in messages to get
> cleared?

I think the test always passes without waiting for the statistics to
be updated since we fetch the subscription worker statistics from the
stats collector based on the entries of pg_subscription catalog. So
this test checks if statistics of already-dropped subscription doesn’t
show up in the view after DROP SUBSCRIPTION, but does not check if the
subscription worker statistics entry in the stats collector gets
removed. The primary reason is that as I mentioned above, the patch
relies on pgstat_vacuum_stat() for cleaning up the dead subscriptions.

>
> 7.
> +# Create subscriptions. The table sync for test_tab2 on tap_sub will enter to
> +# infinite error due to violating the unique constraint.
> +my $appname = 'tap_sub';
> +$node_subscriber->safe_psql(
> +    'postgres',
> +    "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
> application_name=$appname' PUBLICATION tap_pub WITH (streaming = off,
> two_phase = on);");
> +my $appname_streaming = 'tap_sub_streaming';
> +$node_subscriber->safe_psql(
> +    'postgres',
> +    "CREATE SUBSCRIPTION tap_sub_streaming CONNECTION
> '$publisher_connstr application_name=$appname_streaming' PUBLICATION
> tap_pub_streaming WITH (streaming = on, two_phase = on);");
> +
> +$node_publisher->wait_for_catchup($appname);
> +$node_publisher->wait_for_catchup($appname_streaming);
>
> How can we ensure that subscriber would have caught up when one of the
> tablesync workers is constantly in the error loop? Isn't it possible
> that the subscriber didn't send the latest lsn feedback till the table
> sync worker is finished?
>

I thought that even if tablesync for a table is still ongoing, the
apply worker can apply commit records, update write LSN and flush LSN,
and send the feedback to the wal sender. No?

> 8.
> +# Create subscriptions. The table sync for test_tab2 on tap_sub will enter to
> +# infinite error due to violating the unique constraint.
>
> The second sentence of the comment can be written as: "The table sync
> for test_tab2 on tap_sub will enter into infinite error loop due to
> violating the unique constraint."

Fixed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side