Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Design of pg_stat_subscription_workers vs pgstats |
Date | |
Msg-id | CAD21AoCCaQfGzDCAJgYH=yo72GXozpoMF4fau=5gViXK+uquWw@mail.gmail.com Whole thread Raw |
In response to | Re: Design of pg_stat_subscription_workers vs pgstats (Peter Smith <smithpb2250@gmail.com>) |
Responses |
Re: Design of pg_stat_subscription_workers vs pgstats
RE: Design of pg_stat_subscription_workers vs pgstats Re: Design of pg_stat_subscription_workers vs pgstats |
List | pgsql-hackers |
Hi, On Wed, Feb 23, 2022 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi. Below are my review comments for the v1 patch. Thank you for the comments! I've attached the latest version patch that incorporated all comments I got so far. The primary change from the previous version is that the subscription statistics live globally rather than per-database. > > ====== > > 1. Commit message - wording > > As the result of the discussion, we've concluded that the stats > collector is not an appropriate place to store the error information of > subscription workers. > > SUGGESTION > It was decided (refer to the Discussion link below) that the stats > collector is not an appropriate place to store the error information of > subscription workers. Fixed. > > ~~~ > > 2. Commit message - wording > > This commits changes the view so that it stores only statistics > counters: apply_error_count and sync_error_count. > > SUGGESTION > "This commits changes the view" --> "This patch changes the > pg_stat_subscription_workers view" Fixed. > > ~~~ > > 3. Commit message - wording > > Removing these error details, since we don't need to record the > error information for apply workers and tablesync workers separately, > the view now has one entry per subscription. > > DID THIS MEAN... > After removing these error details, there is no longer separate > information for apply workers and tablesync workers, so the view now > has only one entry per subscription. > > -- > > But anyway, that is not entirely true either because those counters > are separate information for the different workers, right? Right. Since I also made the subscription statistics a cluster-wide statistics, I've changed this part accordingly. > > ~~ > > 4. Commit message - wording > > Also, it changes the view name to pg_stat_subscription_activity > since the word "worker" is an implementation detail that we use one > worker for one tablesync. > > SUGGESTION > "Also, it changes" --> "The patch also changes" ... Fixed. > > ~~~ > > 5. doc/src/sgml/monitoring.sgml - wording > > <para> > - The <structname>pg_stat_subscription_workers</structname> view will contain > - one row per subscription worker on which errors have occurred, for workers > - applying logical replication changes and workers handling the initial data > - copy of the subscribed tables. The statistics entry is removed when the > + The <structname>pg_stat_subscription_activity</structname> view will contain > + one row per subscription. The statistics entry is removed when the > corresponding subscription is dropped. > </para> > > SUGGESTION > "The statistics entry is removed" --> "This row is removed" On second thoughts, this sentence is not necessary since it's obvious and descriptions of other stats view don't mention it. > > ~~~ > > 6. doc/src/sgml/monitoring.sgml - why two counters? > > Please forgive this noob question... > > I see there are 2 error_count columns (one for each kind of worker) > but I was wondering why it is useful for users to be able to > distinguish if the error came from the tablesync workers or from the > apply workers? Do you have any example? > > Also, IIRC sometimes the tablesync might actually do a few "apply" > changes itself... so the distinction may become a bit fuzzy... I think that the tablesync phase and the apply phase can fail for different reasons. So these values would be a good indicator for users to check if each phase works fine. After more thoughts, I think it's better to increment sync_error_count also when a tablesync worker fails while applying the changes. These counters will correspond to the error information entries that will be stored in a system catalog. > > ~~~ > > 7. src/backend/postmaster/pgstat.c - comment > > @@ -1313,13 +1312,13 @@ pgstat_vacuum_stat(void) > } > > /* > - * Repeat for subscription workers. Similarly, we needn't bother in the > - * common case where no subscription workers' stats are being collected. > + * Repeat for subscription. Similarly, we needn't bother in the common > + * case where no subscription stats are being collected. > */ > > typo? > > "Repeat for subscription." --> "Repeat for subscriptions." Fixed. > > ~~~ > > 8. src/backend/postmaster/pgstat.c > > @@ -3000,32 +2968,29 @@ pgstat_fetch_stat_funcentry(Oid func_id) > > /* > * --------- > - * pgstat_fetch_stat_subworker_entry() - > + * pgstat_fetch_stat_subentry() - > * > * Support function for the SQL-callable pgstat* functions. Returns > - * the collected statistics for subscription worker or NULL. > + * the collected statistics for subscription or NULL. > * --------- > */ > -PgStat_StatSubWorkerEntry * > -pgstat_fetch_stat_subworker_entry(Oid subid, Oid subrelid) > +PgStat_StatSubEntry * > +pgstat_fetch_stat_subentry(Oid subid) > > There seems some kind of inconsistency because the member name is > called "subscriptions" but sometimes it seems singular. > > Some places (e.g. pgstat_vacuum_stat) will iterate multiple results, > but then other places (like this function) just return to a single > "subscription" (or "entry"). > > I suspect all the code may be fine; probably it is just some > inconsistent (singular/plural) comments that have confused things a > bit. Fixed. > > ~~~ > > 9. src/backend/replication/logical/worker.c - subscription id > > + /* Report the worker failed during table synchronization */ > + pgstat_report_subscription_error(MyLogicalRepWorker->subid, false); > > and > > + /* Report the worker failed during the application of the change */ > + pgstat_report_subscription_error(MyLogicalRepWorker->subid, true); > > > Why don't these use MySubscription->oid instead of MyLogicalRepWorker->subid? It's just because we used to use MyLogicalRepWorker->subid, is there any particular reason why we should use MySubscription->oid here? > > ~~~ > > 10. src/include/pgstat.h - enum order > > @@ -84,8 +84,8 @@ typedef enum StatMsgType > PGSTAT_MTYPE_REPLSLOT, > PGSTAT_MTYPE_CONNECT, > PGSTAT_MTYPE_DISCONNECT, > + PGSTAT_MTYPE_SUBSCRIPTIONERROR, > PGSTAT_MTYPE_SUBSCRIPTIONPURGE, > - PGSTAT_MTYPE_SUBWORKERERROR, > } StatMsgType; > > This change rearranges the enum order. Maybe it is safer not to do this? > I assume you're concerned about binary compatibility or something. I think it should not be a problem since both PGSTAT_MTYPE_SUBWORKERERROR and PGSTAT_MTYPE_SUBSCRIPTIONPURGE are introduced to PG15. > ~~~ > > 11. src/include/pgstat.h > > @@ -767,8 +747,8 @@ typedef union PgStat_Msg > PgStat_MsgReplSlot msg_replslot; > PgStat_MsgConnect msg_connect; > PgStat_MsgDisconnect msg_disconnect; > + PgStat_MsgSubscriptionError msg_subscriptionerror; > PgStat_MsgSubscriptionPurge msg_subscriptionpurge; > - PgStat_MsgSubWorkerError msg_subworkererror; > } PgStat_Msg; > > This change also rearranges the order. Maybe there was no good reason > to do that? It's for keeping the alphabetical order within subscription-related messages. > > ~~~ > > 12. src/include/pgstat.h - PgStat_StatDBEntry > > @@ -823,16 +803,12 @@ typedef struct PgStat_StatDBEntry > TimestampTz stats_timestamp; /* time of db stats file update */ > > /* > - * tables, functions, and subscription workers must be last in the struct, > - * because we don't write the pointers out to the stats file. > - * > - * subworkers is the hash table of PgStat_StatSubWorkerEntry which stores > - * statistics of logical replication workers: apply worker and table sync > - * worker. > + * tables, functions, and subscription must be last in the struct, because > + * we don't write the pointers out to the stats file. > */ > > Should that say "tables, functions, and subscriptions" (plural) This part is removed in the latest patch. On Wed, Feb 23, 2022 at 12:00 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > > Thanks for your patch. > > Few comments: > > 1. > + <structfield>apply_error_count</structfield> <type>uint8</type> > ... > + <structfield>sync_error_count</structfield> <type>uint8</type> > > It seems that Postgres has no data type named uint8, should we change it to > bigint? Right, fixed. > > 2. > +# Wait for the table sync error to be reported. > +$node_subscriber->poll_query_until( > + 'postgres', > + qq[ > +SELECT apply_error_count = 0 AND sync_error_count > 0 > +FROM pg_stat_subscription_activity > +WHERE subname = 'tap_sub' > +]) or die "Timed out while waiting for table sync error"; > > We want to check table sync error here, but do we need to check > "apply_error_count = 0"? I am not sure if it is possible that the apply worker has > an unexpected error, which would cause this test to fail. Yeah, it seems better not to have this condition, fixed. On Wed, Feb 23, 2022 at 3:21 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Below are my review comments just for the v1 patch test code. > > ====== > 1. "table sync error" versus "tablesync error" > > +# Wait for the table sync error to be reported. > +$node_subscriber->poll_query_until( > + 'postgres', > + qq[ > +SELECT apply_error_count = 0 AND sync_error_count > 0 > +FROM pg_stat_subscription_activity > +WHERE subname = 'tap_sub' > +]) or die "Timed out while waiting for table sync error"; > + > +# Truncate test_tab1 so that table sync can continue. > +$node_subscriber->safe_psql('postgres', "TRUNCATE test_tab1;"); > + > # Wait for initial table sync for test_tab1 to finish. > > IMO all these "table sync" should be changed to "tablesync", because a > table "sync error" sounds like something completely different to a > "tablesync error". > > SUGGESTIONS > - "Wait for the table sync error to be reported." --> "Wait for the > tablesync error to be reported." > - "Timed out while waiting for table sync error" --> "Timed out while > waiting for tablesync error" > - "Truncate test_tab1 so that table sync can continue." --> "Truncate > test_tab1 so that tablesync worker can fun to completion." > - "Wait for initial table sync for test_tab1 to finish." --> "Wait for > the tablesync worker of test_tab1 to finish." Fixed. > > ~~~ > > 3. Wait for the apply worker error > > +# Wait for the apply error to be reported. > +$node_subscriber->poll_query_until( > + 'postgres', > + qq[ > +SELECT apply_error_count > 0 AND sync_error_count > 0 > +FROM pg_stat_subscription_activity > +WHERE subname = 'tap_sub' > +]) or die "Timed out while waiting for apply error"; > > This test is only for apply worker errors. So why is the test SQL > checking "AND sync_error_count > 0"? > > (This is similar to what [Tang] #2 reported, but I think she was > referring to the other tablesync test) Fixed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
pgsql-hackers by date: