Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Skipping logical replication transactions on subscriber side |
Date | |
Msg-id | CAA4eK1JheT_MTmQiGfWZJeHSEJGzd-mgPPwBwA=YWjhcTfykPA@mail.gmail.com Whole thread Raw |
In response to | Re: Skipping logical replication transactions on subscriber side (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Skipping logical replication transactions on subscriber side
|
List | pgsql-hackers |
On Mon, Nov 29, 2021 at 11:38 AM vignesh C <vignesh21@gmail.com> wrote: > I have pushed this patch and there is a buildfarm failure for it. See: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25 Sawada-San has shared his initial analysis on pgsql-committers [1] and I am responding here as the fix requires some more discussion. > Looking at the result the test actually got, we had two error entries > for test_tab1 instead of one: > > # Failed test 'check the error reported by the apply worker' > # at t/026_worker_stats.pl line 33. > # got: 'tap_sub|INSERT|test_tab1|t > # tap_sub||test_tab1|t' > # expected: 'tap_sub|INSERT|test_tab1|t' > > The possible scenarios are: > > The table sync worker for test_tab1 failed due to an error unrelated > to apply changes: > > 2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin with > OID 2 is already active for PID 23706 > > At this time, the view had one error entry for the table sync worker. > After retrying table sync, it succeeded: > > 2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication table > synchronization worker for subscription "tap_sub", table "test_tab1" > has finished > > Then after inserting a row on the publisher, the apply worker inserted > the row but failed due to violating a unique key violation, which is > expected: > > 2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value > violates unique constraint "test_tab1_pkey" > 2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already exists. > 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote data > during "INSERT" for replication target relation "public.test_tab1" in > transaction 721 at 2021-11-30 06:24:04.305096+01 > > As a result, we had two error entries for test_tab1: the table sync > worker error and the apply worker error. I didn't expect that the > table sync worker for test_tab1 failed due to "replication origin with > OID 2 is already active for PID 23706” error. > > Looking at test_subscription_error() in 026_worker_stats.pl, we have > two checks; in the first check, we wait for the view to show the error > entry for the given relation name and xid. This check was passed since > we had the second error (i.g., apply worker error). In the second > check, we get error entries from pg_stat_subscription_workers by > specifying only the relation name. Therefore, we ended up getting two > entries and failed the tests. > > To fix this issue, I think that in the second check, we can get the > error from pg_stat_subscription_workers by specifying the relation > name *and* xid like the first check does. I've attached the patch. > What do you think? > I think this will fix the reported failure but there is another race condition in the test. Isn't it possible that for table test_tab2, we get an error "replication origin with OID ..." or some other error before copy, in that case also, we will proceed from the second call of test_subscription_error() which is not what we expect in the test? Shouldn't we someway check that the error message also starts with "duplicate key value violates ..."? [1] - https://www.postgresql.org/message-id/CAD21AoChP5wOT2AYziF%2B-j7vvThF2NyAs7wr%2Byy%2B8hsnu%3D8Rgg%40mail.gmail.com -- With Regards, Amit Kapila.
pgsql-hackers by date: