Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Skipping logical replication transactions on subscriber side |
Date | |
Msg-id | CALDaNm2i8z-Uxs7gb00ZqzW0xfKSwq=Qy3QJRnBND0pOPwBhCQ@mail.gmail.com Whole thread Raw |
In response to | Re: Skipping logical replication transactions on subscriber side (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Tue, Nov 30, 2021 at 7:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Nov 30, 2021 at 8:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Nov 30, 2021 at 6:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Nov 29, 2021 at 11:38 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > I have pushed this patch and there is a buildfarm failure for it. See: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25 > > > > > > Sawada-San has shared his initial analysis on pgsql-committers [1] and > > > I am responding here as the fix requires some more discussion. > > > > > > > Looking at the result the test actually got, we had two error entries > > > > for test_tab1 instead of one: > > > > > > > > # Failed test 'check the error reported by the apply worker' > > > > # at t/026_worker_stats.pl line 33. > > > > # got: 'tap_sub|INSERT|test_tab1|t > > > > # tap_sub||test_tab1|t' > > > > # expected: 'tap_sub|INSERT|test_tab1|t' > > > > > > > > The possible scenarios are: > > > > > > > > The table sync worker for test_tab1 failed due to an error unrelated > > > > to apply changes: > > > > > > > > 2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin with > > > > OID 2 is already active for PID 23706 > > > > > > > > At this time, the view had one error entry for the table sync worker. > > > > After retrying table sync, it succeeded: > > > > > > > > 2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication table > > > > synchronization worker for subscription "tap_sub", table "test_tab1" > > > > has finished > > > > > > > > Then after inserting a row on the publisher, the apply worker inserted > > > > the row but failed due to violating a unique key violation, which is > > > > expected: > > > > > > > > 2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value > > > > violates unique constraint "test_tab1_pkey" > > > > 2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already exists. > > > > 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote data > > > > during "INSERT" for replication target relation "public.test_tab1" in > > > > transaction 721 at 2021-11-30 06:24:04.305096+01 > > > > > > > > As a result, we had two error entries for test_tab1: the table sync > > > > worker error and the apply worker error. I didn't expect that the > > > > table sync worker for test_tab1 failed due to "replication origin with > > > > OID 2 is already active for PID 23706” error. > > > > > > > > Looking at test_subscription_error() in 026_worker_stats.pl, we have > > > > two checks; in the first check, we wait for the view to show the error > > > > entry for the given relation name and xid. This check was passed since > > > > we had the second error (i.g., apply worker error). In the second > > > > check, we get error entries from pg_stat_subscription_workers by > > > > specifying only the relation name. Therefore, we ended up getting two > > > > entries and failed the tests. > > > > > > > > To fix this issue, I think that in the second check, we can get the > > > > error from pg_stat_subscription_workers by specifying the relation > > > > name *and* xid like the first check does. I've attached the patch. > > > > What do you think? > > > > > > > > > > I think this will fix the reported failure but there is another race > > > condition in the test. Isn't it possible that for table test_tab2, we > > > get an error "replication origin with OID ..." or some other error > > > before copy, in that case also, we will proceed from the second call > > > of test_subscription_error() which is not what we expect in the test? > > > > Right. > > > > > Shouldn't we someway check that the error message also starts with > > > "duplicate key value violates ..."? > > > > Yeah, I think it's a good idea to make the checks more specific. That > > is, probably we can specify the prefix of the error message and > > subrelid in addition to the current conditions: relid and xid. That > > way, we can check what error was reported by which workers (tablesync > > or apply) for which relations. And both check queries in > > test_subscription_error() can have the same WHERE clause. > > I've attached a patch that fixes this issue. Please review it. Thanks for the updated patch, the patch applies neatly and make check-world passes. Also I ran the failing test in a loop and found it to be passing always. Regards, Vignesh
pgsql-hackers by date: