Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Skipping logical replication transactions on subscriber side |
Date | |
Msg-id | CAD21AoBu2f2R6Bc9qWpky8v_h822zdkm0419j60Ph_T9ciaVsw@mail.gmail.com Whole thread Raw |
In response to | Re: Skipping logical replication transactions on subscriber side (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Skipping logical replication transactions on subscriber side
|
List | pgsql-hackers |
On Tue, Nov 30, 2021 at 6:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 29, 2021 at 11:38 AM vignesh C <vignesh21@gmail.com> wrote: > > > > I have pushed this patch and there is a buildfarm failure for it. See: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25 > > Sawada-San has shared his initial analysis on pgsql-committers [1] and > I am responding here as the fix requires some more discussion. > > > Looking at the result the test actually got, we had two error entries > > for test_tab1 instead of one: > > > > # Failed test 'check the error reported by the apply worker' > > # at t/026_worker_stats.pl line 33. > > # got: 'tap_sub|INSERT|test_tab1|t > > # tap_sub||test_tab1|t' > > # expected: 'tap_sub|INSERT|test_tab1|t' > > > > The possible scenarios are: > > > > The table sync worker for test_tab1 failed due to an error unrelated > > to apply changes: > > > > 2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin with > > OID 2 is already active for PID 23706 > > > > At this time, the view had one error entry for the table sync worker. > > After retrying table sync, it succeeded: > > > > 2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication table > > synchronization worker for subscription "tap_sub", table "test_tab1" > > has finished > > > > Then after inserting a row on the publisher, the apply worker inserted > > the row but failed due to violating a unique key violation, which is > > expected: > > > > 2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value > > violates unique constraint "test_tab1_pkey" > > 2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already exists. > > 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote data > > during "INSERT" for replication target relation "public.test_tab1" in > > transaction 721 at 2021-11-30 06:24:04.305096+01 > > > > As a result, we had two error entries for test_tab1: the table sync > > worker error and the apply worker error. I didn't expect that the > > table sync worker for test_tab1 failed due to "replication origin with > > OID 2 is already active for PID 23706” error. > > > > Looking at test_subscription_error() in 026_worker_stats.pl, we have > > two checks; in the first check, we wait for the view to show the error > > entry for the given relation name and xid. This check was passed since > > we had the second error (i.g., apply worker error). In the second > > check, we get error entries from pg_stat_subscription_workers by > > specifying only the relation name. Therefore, we ended up getting two > > entries and failed the tests. > > > > To fix this issue, I think that in the second check, we can get the > > error from pg_stat_subscription_workers by specifying the relation > > name *and* xid like the first check does. I've attached the patch. > > What do you think? > > > > I think this will fix the reported failure but there is another race > condition in the test. Isn't it possible that for table test_tab2, we > get an error "replication origin with OID ..." or some other error > before copy, in that case also, we will proceed from the second call > of test_subscription_error() which is not what we expect in the test? Right. > Shouldn't we someway check that the error message also starts with > "duplicate key value violates ..."? Yeah, I think it's a good idea to make the checks more specific. That is, probably we can specify the prefix of the error message and subrelid in addition to the current conditions: relid and xid. That way, we can check what error was reported by which workers (tablesync or apply) for which relations. And both check queries in test_subscription_error() can have the same WHERE clause. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: