Re: Failed transaction statistics to measure the logical replication progress - Mailing list pgsql-hackers

From vignesh C
Subject Re: Failed transaction statistics to measure the logical replication progress
Date
Msg-id CALDaNm3t3WbqZMfFeUfFhp7Q0Oc-0g-FXzRjYEYeq3K_3OLpkQ@mail.gmail.com
Whole thread Raw
In response to Re: Failed transaction statistics to measure the logical replication progress  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Aug 2, 2021 at 2:52 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On Thursday, July 29, 2021 10:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > On Thu, Jul 8, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > > When the current HEAD fails during logical decoding, the failure
> > > > increments txns count in pg_stat_replication_slots - [1] and adds
> > > > the transaction size to the sum of bytes in the same repeatedly on
> > > > the publisher, until the problem is solved.
> > > > One of the good examples is duplication error on the subscriber side
> > > > and this applies to both streaming and spill cases as well.
> > > >
> > > > This update prevents users from grasping the exact number and size
> > > > of successful and unsuccessful transactions. Accordingly, we need to
> > > > have new columns of failed transactions that will work to
> > > > differentiate both of them for all types, which means spill,
> > > > streaming and normal transactions. This will help users to measure
> > > > the exact status of logical replication.
> > >
> > > Could you please elaborate on use cases of the proposed statistics?
> > > For example, the current statistics on pg_replication_slots can be
> > > used for tuning logical_decoding_work_mem as well as inferring the
> > > total amount of bytes passed to the output plugin. How will the user use those statistics?
> > >
> > > Also, if we want the stats of successful transactions why don't we
> > > show the stats of successful transactions in the view instead of ones
> > > of failed transactions?
> > It works to show the ratio of successful and unsuccessful transactions,
> > which should be helpful in terms of administrator perspective.
> > FYI, the POC patch added the columns where I prefixed 'failed' to those names.
> > But, substantially, it meant the ratio when user compared normal columns and
> > newly introduced columns by this POC in the pg_stat_replication_slots.
>
> What can the administrator use the ratio of successful and
> unsuccessful logical replication transactions for? For example, IIUC
> if a conflict happens on the subscriber as you mentioned, the
> successful transaction ratio calculated by those statistics is getting
> low, perhaps meaning the logical replication stopped. But it can be
> checked also by checking pg_stat_replication view or
> pg_replication_slots view (or error counts of
> pg_stat_subscription_errors view I’m proposing[1]). Do you have other
> use cases?

We could also include failed_data_size, this will help us to identify
the actual bandwidth consumed by the failed transaction. It will help
the DBA's to understand the network consumption in a better way.
Currently only total transaction and total data will be available but
when there is a failure, the failed transaction data will be sent
repeatedly, if the DBA does not solve the actual cause of failure,
there can be significant consumption of the network due to failure
transaction being sent repeatedly. DBA will not be able to understand
why there is so much network bandwidth consumption. If we give the
failed transaction information, the DBA might not get alarmed in this
case and understand that the network consumption is genuine. Also it
will help monitoring tools to project this value.

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Failed transaction statistics to measure the logical replication progress
Next
From: Andrey Borodin
Date:
Subject: Re: Slow standby snapshot