Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id CAD21AoCCaQfGzDCAJgYH=yo72GXozpoMF4fau=5gViXK+uquWw@mail.gmail.com
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: Design of pg_stat_subscription_workers vs pgstats
RE: Design of pg_stat_subscription_workers vs pgstats
Re: Design of pg_stat_subscription_workers vs pgstats
List pgsql-hackers
 Hi,

On Wed, Feb 23, 2022 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi. Below are my review comments for the v1 patch.

Thank you for the comments! I've attached the latest version patch
that incorporated all comments I got so far. The primary change from
the previous version is that the subscription statistics live globally
rather than per-database.

>
> ======
>
> 1. Commit message - wording
>
> As the result of the discussion, we've concluded that the stats
> collector is not an appropriate place to store the error information of
> subscription workers.
>
> SUGGESTION
> It was decided (refer to the Discussion link below) that the stats
> collector is not an appropriate place to store the error information of
> subscription workers.

Fixed.

>
> ~~~
>
> 2. Commit message - wording
>
> This commits changes the view so that it stores only statistics
> counters: apply_error_count and sync_error_count.
>
> SUGGESTION
> "This commits changes the view" --> "This patch changes the
> pg_stat_subscription_workers view"

Fixed.

>
> ~~~
>
> 3. Commit message - wording
>
> Removing these error details, since we don't need to record the
> error information for apply workers and tablesync workers separately,
> the view now has one entry per subscription.
>
> DID THIS MEAN...
> After removing these error details, there is no longer separate
> information for apply workers and tablesync workers, so the view now
> has only one entry per subscription.
>
> --
>
> But anyway, that is not entirely true either because those counters
> are separate information for the different workers, right?

Right. Since I also made the subscription statistics a cluster-wide
statistics, I've changed this part accordingly.

>
> ~~
>
> 4. Commit message - wording
>
> Also, it changes the view name to pg_stat_subscription_activity
> since the word "worker" is an implementation detail that we use one
> worker for one tablesync.
>
> SUGGESTION
> "Also, it changes" --> "The patch also changes" ...

Fixed.

>
> ~~~
>
> 5. doc/src/sgml/monitoring.sgml - wording
>
>    <para>
> -   The <structname>pg_stat_subscription_workers</structname> view will contain
> -   one row per subscription worker on which errors have occurred, for workers
> -   applying logical replication changes and workers handling the initial data
> -   copy of the subscribed tables.  The statistics entry is removed when the
> +   The <structname>pg_stat_subscription_activity</structname> view will contain
> +   one row per subscription.  The statistics entry is removed when the
>     corresponding subscription is dropped.
>    </para>
>
> SUGGESTION
> "The statistics entry is removed" --> "This row is removed"

On second thoughts, this sentence is not necessary since it's obvious
and descriptions of other stats view don't mention it.

>
> ~~~
>
> 6.  doc/src/sgml/monitoring.sgml - why two counters?
>
> Please forgive this noob question...
>
> I see there are 2 error_count columns (one for each kind of worker)
> but I was wondering why it is useful for users to be able to
> distinguish if the error came from the tablesync workers or from the
> apply workers? Do you have any example?
>
> Also, IIRC sometimes the tablesync might actually do a few "apply"
> changes itself... so the distinction may become a bit fuzzy...

I think that the tablesync phase and the apply phase can fail for
different reasons. So these values would be a good indicator for users
to check if each phase works fine.

After more thoughts, I think it's better to increment sync_error_count
also when a tablesync worker fails while applying the changes. These
counters will correspond to the error information entries that will be
stored in a system catalog.

>
> ~~~
>
> 7. src/backend/postmaster/pgstat.c - comment
>
> @@ -1313,13 +1312,13 @@ pgstat_vacuum_stat(void)
>   }
>
>   /*
> - * Repeat for subscription workers.  Similarly, we needn't bother in the
> - * common case where no subscription workers' stats are being collected.
> + * Repeat for subscription.  Similarly, we needn't bother in the common
> + * case where no subscription stats are being collected.
>   */
>
> typo?
>
> "Repeat for subscription." --> "Repeat for subscriptions."

Fixed.

>
> ~~~
>
> 8. src/backend/postmaster/pgstat.c
>
> @@ -3000,32 +2968,29 @@ pgstat_fetch_stat_funcentry(Oid func_id)
>
>  /*
>   * ---------
> - * pgstat_fetch_stat_subworker_entry() -
> + * pgstat_fetch_stat_subentry() -
>   *
>   * Support function for the SQL-callable pgstat* functions. Returns
> - * the collected statistics for subscription worker or NULL.
> + * the collected statistics for subscription or NULL.
>   * ---------
>   */
> -PgStat_StatSubWorkerEntry *
> -pgstat_fetch_stat_subworker_entry(Oid subid, Oid subrelid)
> +PgStat_StatSubEntry *
> +pgstat_fetch_stat_subentry(Oid subid)
>
> There seems some kind of inconsistency because the member name is
> called "subscriptions" but sometimes it seems singular.
>
> Some places (e.g. pgstat_vacuum_stat) will iterate multiple results,
> but then other places (like this function) just return to a single
> "subscription" (or "entry").
>
> I suspect all the code may be fine; probably it is just some
> inconsistent (singular/plural) comments that have confused things a
> bit.

Fixed.

>
> ~~~
>
> 9. src/backend/replication/logical/worker.c - subscription id
>
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, false);
>
> and
>
> + /* Report the worker failed during the application of the change */
> + pgstat_report_subscription_error(MyLogicalRepWorker->subid, true);
>
>
> Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid?

It's just because we used to use MyLogicalRepWorker->subid, is there
any particular reason why we should use MySubscription->oid here?

>
> ~~~
>
> 10. src/include/pgstat.h - enum order
>
> @@ -84,8 +84,8 @@ typedef enum StatMsgType
>   PGSTAT_MTYPE_REPLSLOT,
>   PGSTAT_MTYPE_CONNECT,
>   PGSTAT_MTYPE_DISCONNECT,
> + PGSTAT_MTYPE_SUBSCRIPTIONERROR,
>   PGSTAT_MTYPE_SUBSCRIPTIONPURGE,
> - PGSTAT_MTYPE_SUBWORKERERROR,
>  } StatMsgType;
>
> This change rearranges the enum order. Maybe it is safer not to do this?
>

I assume you're concerned about binary compatibility or something. I
think it should not be a problem since both
PGSTAT_MTYPE_SUBWORKERERROR and PGSTAT_MTYPE_SUBSCRIPTIONPURGE are
introduced to PG15.

> ~~~
>
> 11. src/include/pgstat.h
>
> @@ -767,8 +747,8 @@ typedef union PgStat_Msg
>   PgStat_MsgReplSlot msg_replslot;
>   PgStat_MsgConnect msg_connect;
>   PgStat_MsgDisconnect msg_disconnect;
> + PgStat_MsgSubscriptionError msg_subscriptionerror;
>   PgStat_MsgSubscriptionPurge msg_subscriptionpurge;
> - PgStat_MsgSubWorkerError msg_subworkererror;
>  } PgStat_Msg;
>
> This change also rearranges the order. Maybe there was no good reason
> to do that?

It's for keeping the alphabetical order within subscription-related messages.

>
> ~~~
>
> 12. src/include/pgstat.h - PgStat_StatDBEntry
>
> @@ -823,16 +803,12 @@ typedef struct PgStat_StatDBEntry
>   TimestampTz stats_timestamp; /* time of db stats file update */
>
>   /*
> - * tables, functions, and subscription workers must be last in the struct,
> - * because we don't write the pointers out to the stats file.
> - *
> - * subworkers is the hash table of PgStat_StatSubWorkerEntry which stores
> - * statistics of logical replication workers: apply worker and table sync
> - * worker.
> + * tables, functions, and subscription must be last in the struct, because
> + * we don't write the pointers out to the stats file.
>   */
>
> Should that say "tables, functions, and subscriptions" (plural)

This part is removed in the latest patch.

On Wed, Feb 23, 2022 at 12:00 PM tanghy.fnst@fujitsu.com
<tanghy.fnst@fujitsu.com> wrote:
>
>
> Thanks for your patch.
>
> Few comments:
>
> 1.
> +       <structfield>apply_error_count</structfield> <type>uint8</type>
> ...
> +       <structfield>sync_error_count</structfield> <type>uint8</type>
>
> It seems that Postgres has no data type named uint8, should we change it to
> bigint?

Right, fixed.

>
> 2.
> +# Wait for the table sync error to be reported.
> +$node_subscriber->poll_query_until(
> +       'postgres',
> +       qq[
> +SELECT apply_error_count = 0 AND sync_error_count > 0
> +FROM pg_stat_subscription_activity
> +WHERE subname = 'tap_sub'
> +]) or die "Timed out while waiting for table sync error";
>
> We want to check table sync error here, but do we need to check
> "apply_error_count = 0"? I am not sure if it is possible that the apply worker has
> an unexpected error, which would cause this test to fail.

Yeah, it seems better not to have this condition, fixed.

On Wed, Feb 23, 2022 at 3:21 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Below are my review comments just for the v1 patch test code.
>
> ======
> 1. "table sync error" versus "tablesync error"
>
> +# Wait for the table sync error to be reported.
> +$node_subscriber->poll_query_until(
> + 'postgres',
> + qq[
> +SELECT apply_error_count = 0 AND sync_error_count > 0
> +FROM pg_stat_subscription_activity
> +WHERE subname = 'tap_sub'
> +]) or die "Timed out while waiting for table sync error";
> +
> +# Truncate test_tab1 so that table sync can continue.
> +$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;");
> +
>  # Wait for initial table sync for test_tab1 to finish.
>
> IMO all these "table sync" should be changed to "tablesync", because a
> table "sync error" sounds like something completely different to a
> "tablesync error".
>
> SUGGESTIONS
> - "Wait for the table sync error to be reported." --> "Wait for the
> tablesync error to be reported."
> - "Timed out while waiting for table sync error" --> "Timed out while
> waiting for tablesync error"
> - "Truncate test_tab1 so that table sync can continue." -->  "Truncate
> test_tab1 so that tablesync worker can fun to completion."
> - "Wait for initial table sync for test_tab1 to finish." --> "Wait for
> the tablesync worker of test_tab1 to finish."

Fixed.

>
> ~~~
>
> 3. Wait for the apply worker error
>
> +# Wait for the apply error to be reported.
> +$node_subscriber->poll_query_until(
> + 'postgres',
> + qq[
> +SELECT apply_error_count > 0 AND sync_error_count > 0
> +FROM pg_stat_subscription_activity
> +WHERE subname = 'tap_sub'
> +]) or die "Timed out while waiting for apply error";
>
> This test is only for apply worker errors. So why is the test SQL
> checking "AND sync_error_count > 0"?
>
> (This is similar to what [Tang] #2 reported, but I think she was
> referring to the other tablesync test)

Fixed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

pgsql-hackers by date:

Previous
From: Joseph Koshakow
Date:
Subject: Extract epoch from Interval weird behavior
Next
From: Tom Lane
Date:
Subject: Re: convert libpq uri-regress tests to tap test