On Tue, Nov 30, 2021 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 30, 2021 at 9:05 AM Amit Kapila <akapila@postgresql.org> wrote:
> >
> > Add a view to show the stats of subscription workers.
> >
>
> sidewinder is showing failure in the test added by this patch:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25
>
> I'll investigate and share my findings.
I've looked at the issue. Here is my analysis:
Looking at the result the test actually got, we had two error entries
for test_tab1 instead of one:
# Failed test 'check the error reported by the apply worker'
# at t/026_worker_stats.pl line 33.
# got: 'tap_sub|INSERT|test_tab1|t
# tap_sub||test_tab1|t'
# expected: 'tap_sub|INSERT|test_tab1|t'
The possible scenarios are:
The table sync worker for test_tab1 failed due to an error unrelated
to apply changes:
2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin with
OID 2 is already active for PID 23706
At this time, the view had one error entry for the table sync worker.
After retrying table sync, it succeeded:
2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab1"
has finished
Then after inserting a row on the publisher, the apply worker inserted
the row but failed due to violating a unique key violation, which is
expected:
2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value
violates unique constraint "test_tab1_pkey"
2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already exists.
2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote data
during "INSERT" for replication target relation "public.test_tab1" in
transaction 721 at 2021-11-30 06:24:04.305096+01
As a result, we had two error entries for test_tab1: the table sync
worker error and the apply worker error. I didn't expect that the
table sync worker for test_tab1 failed due to "replication origin with
OID 2 is already active for PID 23706” error.
Looking at test_subscription_error() in 026_worker_stats.pl, we have
two checks; in the first check, we wait for the view to show the error
entry for the given relation name and xid. This check was passed since
we had the second error (i.g., apply worker error). In the second
check, we get error entries from pg_stat_subscription_workers by
specifying only the relation name. Therefore, we ended up getting two
entries and failed the tests.
To fix this issue, I think that in the second check, we can get the
error from pg_stat_subscription_workers by specifying the relation
name *and* xid like the first check does. I've attached the patch.
What do you think?
FWIW, I'm still not investigating why "replication origin with OID 2
is already active for PID 23706" occurred. But regardless of the
reason why this error occurred, the attached fix is necessary to make
the tests robust.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/