Re: Report bytes and transactions actually sent downtream - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Report bytes and transactions actually sent downtream |
Date | |
Msg-id | CAExHW5v8yBQXDLP2vPCE7rxQAP0hFCrG=ou=G8kVtBM5squYUg@mail.gmail.com Whole thread Raw |
In response to | Re: Report bytes and transactions actually sent downtream (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Report bytes and transactions actually sent downtream
|
List | pgsql-hackers |
On Sun, Jul 13, 2025 at 4:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 1, 2025 at 7:35 PM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > > > On Tue, Jul 1, 2025 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jun 30, 2025 at 3:24 PM Ashutosh Bapat > > > <ashutosh.bapat.oss@gmail.com> wrote: > > > > > > > > Hi All, > > > > In a recent logical replication issue, there were multiple replication > > > > slots involved, each using a different publication. Thus the amount of > > > > data that was replicated through each slot was expected to be > > > > different. However, total_bytes and total_txns were reported the same > > > > for all the replication slots as expected. One of the slots started > > > > lagging and we were trying to figure out whether its the WAL sender > > > > slowing down or the consumer (in this case Debezium). The lagging > > > > slot then showed total_txns and total_bytes lesser than other slots > > > > giving an impression that the WAL sender is processing the data > > > > slowly. Had pg_stat_replication_slot reported the amount of data > > > > actually sent downstream, it would have been easier to compare it with > > > > the amount of data received by the consumer and thus pinpoint the > > > > bottleneck. > > > > > > > > Here's a patch to do the same. It adds two columns > > > > - sent_txns: The total number of transactions sent downstream. > > > > - sent_bytes: The total number of bytes sent downstream in data messages > > > > to pg_stat_replication_slots. sent_bytes includes only the bytes sent > > > > as part of 'd' messages and does not include keep alive messages or > > > > CopyDone messages for example. But those are very few and can be > > > > ignored. If others feel that those are important to be included, we > > > > can make that change. > > > > > > > > Plugins may choose not to send an empty transaction downstream. It's > > > > better to increment sent_txns counter in the plugin code when it > > > > actually sends a BEGIN message, for example in pgoutput_send_begin() > > > > and pg_output_begin(). This means that every plugin will need to be > > > > modified to increment the counter for it to reported correctly. > > > > > > > > > > What if some plugin didn't implemented it or does it incorrectly? > > > Users will then complain that PG view is showing incorrect value. > > > > That is right. > > > > To fix the problem of plugins not implementing the counter increment > > logic we could use logic similar to how we track whether > > OutputPluginPrepareWrite() has been called or not. In > > ReorderBufferTxn, we add a new member sent_status which would be an > > enum with 3 values UNKNOWN, SENT, NOT_SENT. Initially the sent_status > > = UNKNOWN. We provide a function called > > plugin_sent_txn(ReorderBufferTxn txn, sent bool) which will set > > sent_status = SENT when sent = true and sent_status = NOT_SENT when > > sent = false. In all the end transaction callback wrappers like > > commit_cb_wrapper(), prepare_cb_wrapper(), stream_abort_cb_wrapper(), > > stream_commit_cb_wrapper() and stream_prepare_cb_wrapper(), if > > tsent_status = UNKNOWN, we throw an error. > > > > I think we don't want to make it mandatory for plugins to implement > these stats, so instead of throwing ERROR, the view should show that > the plugin doesn't provide stats. How about having OutputPluginStats > similar to OutputPluginCallbacks and OutputPluginOptions members in > LogicalDecodingContext? It will have members like stats_available, > txns_sent or txns_skipped, txns_filtered, etc. Not making mandatory looks useful. I can try your suggestion. Rather than having stats_available as a member of OutputPluginStats, it's better to have a NULL value for the corresponding member in LogicalDecodingContext. We don't want an output plugin to reset stats_available once set. Will that work? > I am thinking it will > be better to provide this information in a separate view like > pg_stat_plugin_stats or something like that, here we can report > slot_name, plugin_name, then the other stats we want to implement part > of OutputPluginStats. As you have previously pointed out, the view should make it explicit that the new stats are maintained by the plugin and not core. I agree with that intention. However, already have three views pg_replication_slots (which has slot name and plugin name), then pg_replication_stats which is about stats maintained by a WAL sender or running replication and then pg_stat_replication_slots, which is about accumulated statistics for a replication through a given replication slot. It's already a bit hard to keep track of who's who when debugging an issue. Adding one more view will add to confusion. Instead of adding a new view how about a. name the columns as plugin_sent_txns, plugin_sent_bytes, plugin_filtered_change_bytes to make it clear that these columns are maintained by plugin b. report these NULL if stats_available = false OR OutputPluginStats is not set in LogicalDecodingContext c. Document that NULL value for these columns indicates that the plugin is not maintaining/reporting these stats d. adding plugin name to pg_stat_replication_slots, that will make it easy for users to know which plugin they should look at in case of dubious or unavailable stats -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: