RE: Failed transaction statistics to measure the logical replication progress - Mailing list pgsql-hackers

From osumi.takamichi@fujitsu.com
Subject RE: Failed transaction statistics to measure the logical replication progress
Date
Msg-id TYWPR01MB8362B30A904274A911C0B1CCED039@TYWPR01MB8362.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Failed transaction statistics to measure the logical replication progress  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: Failed transaction statistics to measure the logical replication progress
RE: Failed transaction statistics to measure the logical replication progress
List pgsql-hackers
On Tuesday, March 1, 2022 4:12 PM Peter Smith <smithpb2250@gmail.com> wrote:
> Please see below my review comments for v25.
> 
> ======
> 
> 1. Commit message
> 
> Introduce cumulative columns of transactions of logical replication subscriber
> to the pg_stat_subscription_stats view.
> 
> "cumulative columns of transactions" sounds a bit strange to me.
> 
> SUGGESTED
> Introduce 2 new subscription statistics columns (apply_commit_count, and
> apply_rollback_count) to the pg_stat_subscription_stats view for counting
> cumulative transaction commits/rollbacks.
Fixed.



> ~~~
> 
> 2. doc/src/sgml/monitoring.sgml - bug
> 
> The new SGML <row>s have been added in the wrong place!
> 
> I don't think this renders like you expect it does. Please regenerate the help to
> see for yourself.
Fixed.


> ~~~
> 
> 3. doc/src/sgml/monitoring.sgml - wording
> 
> +      <para>
> +       Number of transactions rollbacked in this subscription. Both
> +       <command>ROLLBACK</command> of transaction streamed as
> in-progress
> +       transaction and <command>ROLLBACK PREPARED</command>
> increment this
> +       counter.
> +      </para></entry>
> 
> BEFORE
> Number of transactions rollbacked in this subscription.
> 
> SUGGESTED
> Number of transaction rollbacks in this subscription.
Fixed.


> ~~~
> 
> 4. doc/src/sgml/monitoring.sgml - wording
> 
> +      <para>
> +       Number of transactions rollbacked in this subscription. Both
> +       <command>ROLLBACK</command> of transaction streamed as
> in-progress
> +       transaction and <command>ROLLBACK PREPARED</command>
> increment this
> +       counter.
> +      </para></entry>
> 
> Trying to distinguish between the ROLLBACK of a transaction and of a
> streamed in-progress transaction seems to have made this description too
> complicated. I don't think the user even cares/knows about this
> (in-progress) distinction. So, I think this should just be written more simply
> (like the COMMIT part was)
> 
> BEFORE
> Both  <command>ROLLBACK</command> of transaction streamed as
> in-progress  transaction and <command>ROLLBACK
> PREPARED</command> increment this counter.
> 
> SUGGESTED
> Both  <command>ROLLBACK</command> and <command>ROLLBACK
> PREPARED</command> increment this counter.
Fixed.


> ~~~
> 
> 5. Question - column names.
> 
> Just curious why the columns are called "apply_commit_count" and
> "apply_rollback_count"? Specifically, what extra meaning do those names have
> versus just calling them "commit_count" and "rollback_count"?
I think there's possibility that we'll have counters
for tablesync commit for example. So, the name prefix avoids
the overlap between the possible names.



> ~~~
> 
> 6. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> @@ -3421,6 +3425,60 @@ pgstat_send_slru(void)  }
> 
>  /* ----------
> + * pgstat_report_subscription_xact() -
> + *
> + * Send a subscription transaction stats to the collector.
> + * The statistics are cleared upon sending.
> + *
> + * 'force' is true only when the subscription worker process exits.
> + * ----------
> + */
> +void
> +pgstat_report_subscription_xact(bool force)
> 
> 6a.
> I think this comment should be worded more like the other
> pgstat_report_subscption_XXX comments
> 
> BEFORE
> Send a subscription transaction stats to the collector.
> 
> SUGGESTED
> Tell the collector about subscriptions transaction stats.
Fixed.



> 6b.
> + * 'force' is true only when the subscription worker process exits.
> 
> I thought this comment should just describe what the 'force' param actually
> does in this function; not the scenario about who calls it...
Fixed.



> ~~~
> 
> 7. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> I think the entire function maybe should be relocated to be nearby the other
> pgstat_report_subscription_XXX functions in the source.
I placed the pgstat_report_subscription_xact below pgstat_report_subscription_drop.
Meanwhile, pgstat_recv_subscription_xact, another new function in pgstat.c,
is already placed below pgstat_recv_subscription_error, so I kept it as it is.



> ~~~
> 
> 8. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + /*
> + * This function can be called even if nothing at all has happened. In
> + * this case, there's no need to go forward.
> + */
> 
> Too much information. Clearly, it is possible for this function to be called for this
> case otherwise this code would not exist in the first place :) IMO the comment
> can be much simpler but still say all it needs to.
> 
> BEFORE
> This function can be called even if nothing at all has happened. In this case,
> there's no need to go forward.
> SUGGESTED
> Bailout early if nothing to do.
Fixed.


> ~~~
> 
> 9. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + if (subStats.subid == InvalidOid ||
> + (subStats.apply_commit_count == 0 && subStats.apply_rollback_count ==
> + 0)) return;
> 
> Maybe using !OisIsValid(subStats.subid) is better?
Fixed.


> ~~~
> 
> 10. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + /*
> + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> + * msec since we last sent one to avoid overloading the stats
> + * collector.
> + */
> 
> SUGGESTED (2 sentences instead of 1)
> Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> msec since we last sent one. This is to avoid overloading the stats collector.
Fixed.



> ~~~
> 
> 11. src/backend/postmaster/pgstat.c - pgstat_report_subscription_xact
> 
> + if (!force)
> + {
> + TimestampTz now = GetCurrentTimestamp();
> +
> + /*
> + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> + * msec since we last sent one to avoid overloading the stats
> + * collector.
> + */
> + if (!TimestampDifferenceExceeds(last_report, now,
> + PGSTAT_STAT_INTERVAL)) return; last_report = now; }
> 
> (Yeah, I know there is similar code in this module but 2 wrongs do not make a
> right)
> 
> I think logically it is better to put the 'now' and the 'last_report'
> outside this if (!force) block. Otherwise, the forced report is not setting the
> 'last_report' time and that just seems strange.
> 
> Rearranging this code is better IMO. e.g.
> - the conditions are expressed positive instead of negative (!)
> - only one return point instead of multiple
> - the 'last_report' is always set so that strangeness is eliminated
> 
> SUGGESTED (it's the same code but rearranged)
> 
> TimestampTz now = GetCurrentTimestamp();
> 
> if (force || TimestampDifferenceExceeds(last_report, now,
> PGSTAT_STAT_INTERVAL)) {
> /*
> * Prepare and send the message.
> */
> pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_SUBSCRIPTIONXACT);
> msg.m_databaseid = MyDatabaseId; msg.m_subid = subStats.subid;
> msg.apply_commit_count = subStats.apply_commit_count;
> msg.apply_rollback_count = subStats.apply_rollback_count;
> pgstat_send(&msg, sizeof(PgStat_MsgSubscriptionXact));
> last_report = now;
> 
> /*
> * Clear out the statistics.
> */
> subStats.apply_commit_count = 0;
> subStats.apply_rollback_count = 0;
> }
Yeah, your suggestion looks tidy.
Yet, I wasn't sure if I should set the 'last_report' for exit case,
since we don't use it after the worker exit.
In addition, we need to calculate GetCurrentTimestamp()
even in the case 'force' is set to true.
I'm not sure if that is correct.

So, I'd like to keep it as it as at this stage.


> ~~~
> 
> 12. src/backend/replication/logical/worker.c - LogicalRepSubscriptionStats
> 
> @@ -238,6 +238,8 @@ static ApplyErrorCallbackArg apply_error_callback_arg
> =
>   .ts = 0,
>  };
> 
> +LogicalRepSubscriptionStats subStats = {InvalidOid, 0, 0};
> 
> Maybe better to show explicit the member assignments (like the struct above
> this one) to make it more clear.
True. Fixed.


> ~~~
> 
> 13. src/backend/replication/logical/worker.c - subscription_stats_update
> 
> @@ -3372,6 +3386,22 @@ TwoPhaseTransactionGid(Oid subid, TransactionId
> xid, char *gid, int szgid)
>   snprintf(gid, szgid, "pg_gid_%u_%u", subid, xid);  }
> 
> +/*
> + * Update the statistics of subscription.
> + */
> +static void
> +subscription_stats_update(bool is_commit) {
> +Assert(OidIsValid(subStats.subid));
> +
> + if (is_commit)
> + subStats.apply_commit_count++;
> + else
> + subStats.apply_rollback_count++;
> +
> + pgstat_report_subscription_xact(false);
> +}
> +
> 
> I felt maybe this would be look better split into 2 functions: e.g.
> subscription_stats_incr_commit() and
> subscription_stats_incr_rollback(). Then it would be more readable from all
> the callers instead of the vague looking subscription_stats_update(true/false).
Okay. Fixed.

Probably, I suppose the ideal solution here would be probably to come up
with a good name for one unified function that explains the internal processing by itself.
I spent some time to try to create a new good name (e.g. "subscription_committed_stats_update")
but all ideas weren't good. Then, I decided to make them separate.


> ~~~
> 
> 14. src/backend/replication/logical/worker.c - subscription_stats_update
> 
> +/*
> + * Update the statistics of subscription.
> + */
> +static void
> +subscription_stats_update(bool is_commit) {
> +Assert(OidIsValid(subStats.subid));
> +
> + if (is_commit)
> + subStats.apply_commit_count++;
> + else
> + subStats.apply_rollback_count++;
> +
> + pgstat_report_subscription_xact(false);
> +}
> 
> Is it really necessary to be calling
> pgstat_report_subscription_xactfrom here? That is already being called in the
> LogicalRepApplyLoop. Isn't that enough?
Deleted.


> ~~~
> 
> 15. src/backend/replication/logical/worker.c - LogicalRepApplyLoop
> 
> @@ -2717,6 +2729,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
>   if (endofstream)
>   break;
> 
> + pgstat_report_subscription_xact(false);
> +
> 
> Wondering if this call is better to be done a couple of lines up (above the
> 'if/break'). Especially if you remove the call from the subscription_stats_update
> as suggested in my review comment #14.
I feel this should be after the condition.

I checked with my debugger that after the break of LogicalRepApplyLoop
(when endofstream equals true), we'd call proc_exit. This leads to 
the call pgstat_report_subscription_xact in the logicalrep_worker_exit.

In this function, we use it with 'true' argument.
So, if we call the pgstat_report_subscription_xact before the endofstream check,
we have to call the function twice in the end.

Therefore, the current position makes sense to me.


> ~~~
> 
> 16. src/backend/utils/adt/pgstatfuncs.c - pg_stat_get_subscription_stats
> 
> @@ -2424,7 +2424,11 @@
> pg_stat_get_subscription_stats(PG_FUNCTION_ARGS)
>      INT8OID, -1, 0);
>   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "sync_error_count",
>      INT8OID, -1, 0);
> - TupleDescInitEntry(tupdesc, (AttrNumber) 4, "stats_reset",
> + TupleDescInitEntry(tupdesc, (AttrNumber) 4, "apply_commit_count",
> +    INT8OID, -1, 0);
> + TupleDescInitEntry(tupdesc, (AttrNumber) 5, "sync_rollback_count",
> +    INT8OID, -1, 0);
> + TupleDescInitEntry(tupdesc, (AttrNumber) 6, "stats_reset",
>      TIMESTAMPTZOID, -1, 0);
> 
> Bug? What is "sync_rollback_count"? Looks like a cut/paste error.
Fixed.


> ~~~
> 
> 17. src/include/pgstat.h
> 
> +typedef struct PgStat_MsgSubscriptionXact {  PgStat_MsgHdr m_hdr;
> +
> + /* determine the worker entry */
> + Oid m_databaseid;
> + Oid m_subid;
> +
> + PgStat_Counter apply_commit_count;
> + PgStat_Counter apply_rollback_count;
> +} PgStat_MsgSubscriptionXact;
> 
> Is that m_databaseid even needed? I did not notice it getting used (e.g.
> pgstat_recv_subscription_xact does not use it). Also, wasn't similar removed
> from the other subscription error stats?
Fixed.



> ~~~
> 
> 18. src/include/pgstat.h
> 
> + PgStat_Counter apply_rollback_count;
> +} PgStat_MsgSubscriptionXact;
> +
> +
> 
> The extra blank line can be removed.
Fixed.


> ~~~
> 
> 19. src/include/pgstat.h
> 
> @@ -1177,6 +1201,8 @@ extern void pgstat_send_archiver(const char *xlog,
> bool failed);  extern void pgstat_send_bgwriter(void);  extern void
> pgstat_send_checkpointer(void);  extern void pgstat_send_wal(bool force);
> +extern void pgstat_report_subscription_xact(bool force);
> +
> 
>  /* ----------
> 
> The extra blank line can be removed.
Fixed.


> ~~~
> 
> 20. Test for the column names.
> 
> The patch added a couple of new columns to statistics so I was surprised there
> were no regression test updates needed for these? How can that be?
> Shouldn't there be just one regression test that validates the view column
> names are what they are expected to be?
In my earlier versions, I had some tests
that covered major types of transactions (e.g. stream commit,
stream abort, rollback prepared...) and waited for accurate update of new counters.
But, I dropped those after an advice the result of such kind of tests
wouldn't become stable fundamentally.

Also, I quickly checked other similar views(pg_stat_slru, pg_stat_wal_receiver)
commit logs, especially when they introduce columns.
But, I couldn't find column name validations.
So, I feel this is aligned.


Best Regards,
    Takamichi Osumi


Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Next
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: logical replication restrictions