Thread: pgsql: Add a view to show the stats of subscription workers.
Add a view to show the stats of subscription workers. This commit adds a new system view pg_stat_subscription_workers, that shows information about any errors which occur during the application of logical replication changes as well as during performing initial table synchronization. The subscription statistics entries are removed when the corresponding subscription is removed. It also adds an SQL function pg_stat_reset_subscription_worker() to reset single subscription errors. The contents of this view can be used by an upcoming patch that skips the particular transaction that conflicts with the existing data on the subscriber. This view can be extended in the future to track other xact related statistics like the number of xacts committed/aborted for subscription workers. Author: Masahiko Sawada Reviewed-by: Greg Nancarrow, Hou Zhijie, Tang Haiying, Vignesh C, Dilip Kumar, Takamichi Osumi, Amit Kapila Discussion: https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/8d74fc96db5fd547e077bf9bf4c3b67f821d71cd Modified Files -------------- doc/src/sgml/monitoring.sgml | 157 +++++++++++ src/backend/catalog/system_functions.sql | 4 + src/backend/catalog/system_views.sql | 23 ++ src/backend/commands/subscriptioncmds.c | 16 +- src/backend/postmaster/pgstat.c | 409 +++++++++++++++++++++++++++- src/backend/replication/logical/worker.c | 54 +++- src/backend/utils/adt/pgstatfuncs.c | 128 ++++++++- src/include/catalog/catversion.h | 2 +- src/include/catalog/pg_proc.dat | 18 ++ src/include/pgstat.h | 109 +++++++- src/test/regress/expected/rules.out | 18 ++ src/test/subscription/t/026_worker_stats.pl | 154 +++++++++++ src/tools/pgindent/typedefs.list | 4 + 13 files changed, 1069 insertions(+), 27 deletions(-)
On Tue, Nov 30, 2021 at 9:05 AM Amit Kapila <akapila@postgresql.org> wrote: > > Add a view to show the stats of subscription workers. > sidewinder is showing failure in the test added by this patch: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25 I'll investigate and share my findings. -- With Regards, Amit Kapila.
On Tue, Nov 30, 2021 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 30, 2021 at 9:05 AM Amit Kapila <akapila@postgresql.org> wrote: > > > > Add a view to show the stats of subscription workers. > > > > sidewinder is showing failure in the test added by this patch: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25 > > I'll investigate and share my findings. I've looked at the issue. Here is my analysis: Looking at the result the test actually got, we had two error entries for test_tab1 instead of one: # Failed test 'check the error reported by the apply worker' # at t/026_worker_stats.pl line 33. # got: 'tap_sub|INSERT|test_tab1|t # tap_sub||test_tab1|t' # expected: 'tap_sub|INSERT|test_tab1|t' The possible scenarios are: The table sync worker for test_tab1 failed due to an error unrelated to apply changes: 2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin with OID 2 is already active for PID 23706 At this time, the view had one error entry for the table sync worker. After retrying table sync, it succeeded: 2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab1" has finished Then after inserting a row on the publisher, the apply worker inserted the row but failed due to violating a unique key violation, which is expected: 2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value violates unique constraint "test_tab1_pkey" 2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already exists. 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote data during "INSERT" for replication target relation "public.test_tab1" in transaction 721 at 2021-11-30 06:24:04.305096+01 As a result, we had two error entries for test_tab1: the table sync worker error and the apply worker error. I didn't expect that the table sync worker for test_tab1 failed due to "replication origin with OID 2 is already active for PID 23706” error. Looking at test_subscription_error() in 026_worker_stats.pl, we have two checks; in the first check, we wait for the view to show the error entry for the given relation name and xid. This check was passed since we had the second error (i.g., apply worker error). In the second check, we get error entries from pg_stat_subscription_workers by specifying only the relation name. Therefore, we ended up getting two entries and failed the tests. To fix this issue, I think that in the second check, we can get the error from pg_stat_subscription_workers by specifying the relation name *and* xid like the first check does. I've attached the patch. What do you think? FWIW, I'm still not investigating why "replication origin with OID 2 is already active for PID 23706" occurred. But regardless of the reason why this error occurred, the attached fix is necessary to make the tests robust. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Tue, Nov 30, 2021 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Nov 30, 2021 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > FWIW, I'm still not investigating why "replication origin with OID 2 > is already active for PID 23706" occurred. > I think I know why this error occurred. In process 23706 (which is doing table sync for one of the tables) after allocating replication origin with OID 2, we have inserted the entry for same in pg_replication_origin and associated it with that session. Now, an expected error happened (Copy failure) which rolled back the transaction where we allocated and inserted the origin with OID 2. Before session 23706 could exit where we will disassociate the replication origin state from the current process, a new process (table sync for another table) allocated the replication origin OID 2 and tried to associate with its session where it gets the error (replication origin with OID 2 is already active for PID 23706). The next time it tried there was no such concurrent process to hold the OID and it succeeded. I think this is all as per our current implementation of table sync worker. We just need to be careful in the new test. I'll review your patch next. -- With Regards, Amit Kapila.
On Tue, Nov 30, 2021 at 12:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Nov 30, 2021 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Nov 30, 2021 at 9:05 AM Amit Kapila <akapila@postgresql.org> wrote: > > > > > > Add a view to show the stats of subscription workers. > > > > > > > sidewinder is showing failure in the test added by this patch: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25 > > > > I'll investigate and share my findings. > > I've looked at the issue. Here is my analysis: > > Looking at the result the test actually got, we had two error entries > for test_tab1 instead of one: > > # Failed test 'check the error reported by the apply worker' > # at t/026_worker_stats.pl line 33. > # got: 'tap_sub|INSERT|test_tab1|t > # tap_sub||test_tab1|t' > # expected: 'tap_sub|INSERT|test_tab1|t' > > The possible scenarios are: > > The table sync worker for test_tab1 failed due to an error unrelated > to apply changes: > > 2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin with > OID 2 is already active for PID 23706 > > At this time, the view had one error entry for the table sync worker. > After retrying table sync, it succeeded: > > 2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication table > synchronization worker for subscription "tap_sub", table "test_tab1" > has finished > > Then after inserting a row on the publisher, the apply worker inserted > the row but failed due to violating a unique key violation, which is > expected: > > 2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value > violates unique constraint "test_tab1_pkey" > 2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already exists. > 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote data > during "INSERT" for replication target relation "public.test_tab1" in > transaction 721 at 2021-11-30 06:24:04.305096+01 > > As a result, we had two error entries for test_tab1: the table sync > worker error and the apply worker error. I didn't expect that the > table sync worker for test_tab1 failed due to "replication origin with > OID 2 is already active for PID 23706” error. > > Looking at test_subscription_error() in 026_worker_stats.pl, we have > two checks; in the first check, we wait for the view to show the error > entry for the given relation name and xid. This check was passed since > we had the second error (i.g., apply worker error). In the second > check, we get error entries from pg_stat_subscription_workers by > specifying only the relation name. Therefore, we ended up getting two > entries and failed the tests. > > To fix this issue, I think that in the second check, we can get the > error from pg_stat_subscription_workers by specifying the relation > name *and* xid like the first check does. I've attached the patch. > What do you think? > I have responded to this on the original -hackers thread [1] as I think the test case might need some more changes. [1] - https://www.postgresql.org/message-id/CAA4eK1JheT_MTmQiGfWZJeHSEJGzd-mgPPwBwA%3DYWjhcTfykPA%40mail.gmail.com -- With Regards, Amit Kapila.