Thread: pgsql: Add a view to show the stats of subscription workers.

pgsql: Add a view to show the stats of subscription workers.

From
Amit Kapila
Date:
Add a view to show the stats of subscription workers.

This commit adds a new system view pg_stat_subscription_workers, that
shows information about any errors which occur during the application of
logical replication changes as well as during performing initial table
synchronization. The subscription statistics entries are removed when the
corresponding subscription is removed.

It also adds an SQL function pg_stat_reset_subscription_worker() to reset
single subscription errors.

The contents of this view can be used by an upcoming patch that skips the
particular transaction that conflicts with the existing data on the
subscriber.

This view can be extended in the future to track other xact related
statistics like the number of xacts committed/aborted for subscription
workers.

Author: Masahiko Sawada
Reviewed-by: Greg Nancarrow, Hou Zhijie, Tang Haiying, Vignesh C, Dilip Kumar, Takamichi Osumi, Amit Kapila
Discussion: https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/8d74fc96db5fd547e077bf9bf4c3b67f821d71cd

Modified Files
--------------
doc/src/sgml/monitoring.sgml                | 157 +++++++++++
src/backend/catalog/system_functions.sql    |   4 +
src/backend/catalog/system_views.sql        |  23 ++
src/backend/commands/subscriptioncmds.c     |  16 +-
src/backend/postmaster/pgstat.c             | 409 +++++++++++++++++++++++++++-
src/backend/replication/logical/worker.c    |  54 +++-
src/backend/utils/adt/pgstatfuncs.c         | 128 ++++++++-
src/include/catalog/catversion.h            |   2 +-
src/include/catalog/pg_proc.dat             |  18 ++
src/include/pgstat.h                        | 109 +++++++-
src/test/regress/expected/rules.out         |  18 ++
src/test/subscription/t/026_worker_stats.pl | 154 +++++++++++
src/tools/pgindent/typedefs.list            |   4 +
13 files changed, 1069 insertions(+), 27 deletions(-)


Re: pgsql: Add a view to show the stats of subscription workers.

From
Amit Kapila
Date:
On Tue, Nov 30, 2021 at 9:05 AM Amit Kapila <akapila@postgresql.org> wrote:
>
> Add a view to show the stats of subscription workers.
>

sidewinder is showing failure in the test added by this patch:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25

I'll investigate and share my findings.

-- 
With Regards,
Amit Kapila.



Re: pgsql: Add a view to show the stats of subscription workers.

From
Masahiko Sawada
Date:
On Tue, Nov 30, 2021 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 30, 2021 at 9:05 AM Amit Kapila <akapila@postgresql.org> wrote:
> >
> > Add a view to show the stats of subscription workers.
> >
>
> sidewinder is showing failure in the test added by this patch:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25
>
> I'll investigate and share my findings.

I've looked at the issue. Here is my analysis:

Looking at the result the test actually got, we had two error entries
for test_tab1 instead of one:

#   Failed test 'check the error reported by the apply worker'
#   at t/026_worker_stats.pl line 33.
#          got: 'tap_sub|INSERT|test_tab1|t
# tap_sub||test_tab1|t'
#     expected: 'tap_sub|INSERT|test_tab1|t'

The possible scenarios are:

The table sync worker for test_tab1 failed due to an error unrelated
to apply changes:

2021-11-30 06:24:02.137 CET [18990:2] ERROR:  replication origin with
OID 2 is already active for PID 23706

At this time, the view had one error entry for the table sync worker.
After retrying table sync, it succeeded:

2021-11-30 06:24:04.202 CET [28117:2] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab1"
has finished

Then after inserting a row on the publisher, the apply worker inserted
the row but failed due to violating a unique key violation, which is
expected:

2021-11-30 06:24:04.307 CET [4806:2] ERROR:  duplicate key value
violates unique constraint "test_tab1_pkey"
2021-11-30 06:24:04.307 CET [4806:3] DETAIL:  Key (a)=(1) already exists.
2021-11-30 06:24:04.307 CET [4806:4] CONTEXT:  processing remote data
during "INSERT" for replication target relation "public.test_tab1" in
transaction 721 at 2021-11-30 06:24:04.305096+01

As a result, we had two error entries for test_tab1: the table sync
worker error and the apply worker error. I didn't expect that the
table sync worker for test_tab1 failed due to "replication origin with
OID 2 is already active for PID 23706” error.

Looking at test_subscription_error() in 026_worker_stats.pl, we have
two checks; in the first check, we wait for the view to show the error
entry for the given relation name and xid. This check was passed since
we had the second error (i.g., apply worker error). In the second
check, we get error entries from pg_stat_subscription_workers by
specifying only the relation name. Therefore, we ended up getting two
entries and failed the tests.

To fix this issue, I think that in the second check, we can get the
error from pg_stat_subscription_workers by specifying the relation
name *and* xid like the first check does. I've attached the patch.
What do you think?

FWIW, I'm still not investigating why "replication origin with OID 2
is already active for PID 23706" occurred. But regardless of the
reason why this error occurred, the attached fix is necessary to make
the tests robust.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: pgsql: Add a view to show the stats of subscription workers.

From
Amit Kapila
Date:
On Tue, Nov 30, 2021 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Nov 30, 2021 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> FWIW, I'm still not investigating why "replication origin with OID 2
> is already active for PID 23706" occurred.
>

I think I know why this error occurred. In process 23706 (which is
doing table sync for one of the tables) after allocating replication
origin with OID 2, we have inserted the entry for same in
pg_replication_origin and associated it with that session. Now, an
expected error happened (Copy failure) which rolled back the
transaction where we allocated and inserted the origin with OID 2.
Before session 23706 could exit where we will disassociate the
replication origin state from the current process, a new process
(table sync for another table) allocated the replication origin OID 2
and tried to associate with its session where it gets the error
(replication origin with OID 2
is already active for PID 23706). The next time it tried there was no
such concurrent process to hold the OID and it succeeded. I think this
is all as per our current implementation of table sync worker. We just
need to be careful in the new test.

I'll review your patch next.

-- 
With Regards,
Amit Kapila.



Re: pgsql: Add a view to show the stats of subscription workers.

From
Amit Kapila
Date:
On Tue, Nov 30, 2021 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Nov 30, 2021 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Nov 30, 2021 at 9:05 AM Amit Kapila <akapila@postgresql.org> wrote:
> > >
> > > Add a view to show the stats of subscription workers.
> > >
> >
> > sidewinder is showing failure in the test added by this patch:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25
> >
> > I'll investigate and share my findings.
>
> I've looked at the issue. Here is my analysis:
>
> Looking at the result the test actually got, we had two error entries
> for test_tab1 instead of one:
>
> #   Failed test 'check the error reported by the apply worker'
> #   at t/026_worker_stats.pl line 33.
> #          got: 'tap_sub|INSERT|test_tab1|t
> # tap_sub||test_tab1|t'
> #     expected: 'tap_sub|INSERT|test_tab1|t'
>
> The possible scenarios are:
>
> The table sync worker for test_tab1 failed due to an error unrelated
> to apply changes:
>
> 2021-11-30 06:24:02.137 CET [18990:2] ERROR:  replication origin with
> OID 2 is already active for PID 23706
>
> At this time, the view had one error entry for the table sync worker.
> After retrying table sync, it succeeded:
>
> 2021-11-30 06:24:04.202 CET [28117:2] LOG:  logical replication table
> synchronization worker for subscription "tap_sub", table "test_tab1"
> has finished
>
> Then after inserting a row on the publisher, the apply worker inserted
> the row but failed due to violating a unique key violation, which is
> expected:
>
> 2021-11-30 06:24:04.307 CET [4806:2] ERROR:  duplicate key value
> violates unique constraint "test_tab1_pkey"
> 2021-11-30 06:24:04.307 CET [4806:3] DETAIL:  Key (a)=(1) already exists.
> 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT:  processing remote data
> during "INSERT" for replication target relation "public.test_tab1" in
> transaction 721 at 2021-11-30 06:24:04.305096+01
>
> As a result, we had two error entries for test_tab1: the table sync
> worker error and the apply worker error. I didn't expect that the
> table sync worker for test_tab1 failed due to "replication origin with
> OID 2 is already active for PID 23706” error.
>
> Looking at test_subscription_error() in 026_worker_stats.pl, we have
> two checks; in the first check, we wait for the view to show the error
> entry for the given relation name and xid. This check was passed since
> we had the second error (i.g., apply worker error). In the second
> check, we get error entries from pg_stat_subscription_workers by
> specifying only the relation name. Therefore, we ended up getting two
> entries and failed the tests.
>
> To fix this issue, I think that in the second check, we can get the
> error from pg_stat_subscription_workers by specifying the relation
> name *and* xid like the first check does. I've attached the patch.
> What do you think?
>

I have responded to this on the original -hackers thread [1] as I
think the test case might need some more changes.

[1] - https://www.postgresql.org/message-id/CAA4eK1JheT_MTmQiGfWZJeHSEJGzd-mgPPwBwA%3DYWjhcTfykPA%40mail.gmail.com

--
With Regards,
Amit Kapila.