Re: Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Race condition in FetchTableStates() breaks synchronization of subscription tables |
Date | |
Msg-id | CALDaNm1n-EHOAU7MMoVCEBOmX_V40dfpB1XTpETDk4qQivXXUg@mail.gmail.com Whole thread Raw |
In response to | RE: Race condition in FetchTableStates() breaks synchronization of subscription tables ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Responses |
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
|
List | pgsql-hackers |
On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > > > > > Hello hackers, > > > > > > After determining a possible cause for intermittent failures of the > > > test subscription/031_column_list [1], I was wondering what makes > > > another subscription test (014_binary) fail on the buildfarm: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=20 > > > 24-01-22%2001%3A19%3A03 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202 > > > 4-01-14%2018%3A19%3A20 > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=202 > > > 3-12-21%2001%3A11%3A52 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202 > > > 3-11-27%2001%3A42%3A39 > > > > > > All those failures caused by a timeout when waiting for a message > > > expected in _subscriber.log. For example, in the snakefly's case: > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > > incorrect binary data format) at > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl > > line 269. > > > > > > _subscriber.log contains: > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data > > > from WAL stream: server closed the connection unexpectedly > > > This probably means the server terminated abnormally > > > before or while processing the request. > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical > > > replication apply worker" (PID 409999) exited with exit code 1 > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply > > > worker for subscription "tsub" has started ... > > > > > > While _subscriber.log from a successful test run contains: > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > > session time: 0:00:00.014 user=postgres database=postgres host=[local] > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply > > > worker for subscription "tsub" has started > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table > > > synchronization worker for subscription "tsub", table > > > "test_mismatching_types" has started > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > > format > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > > test_mismatching_types, line 1, column a > > > > > > In this case, "logical replication apply worker for subscription > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes > > later. > > > > > > I've managed to reproduce this failure locally by running multiple > > > tests in parallel, and my analysis shows that it is caused by a race > > > condition when accessing variable table_states_valid inside tablesync.c. > > > > > > tablesync.c does the following with table_states_valid: > > > /* > > > * Callback from syscache invalidation. > > > */ > > > void > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > > hashvalue) { > > > table_states_valid = false; > > > } > > > ... > > > static bool > > > FetchTableStates(bool *started_tx) > > > { > > > ... > > > if (!table_states_valid) > > > { > > > ... > > > /* Fetch all non-ready tables. */ > > > rstates = GetSubscriptionRelations(MySubscription->oid, > > > true); ... > > > table_states_valid = true; > > > } > > > > > > So, when syscache invalidation occurs inside the code block "if > > > (!table_states_valid)", that invalidation is effectively ignored. > > > > > > In a failed case I observe the following events: > > > 1. logical replication apply worker performs > > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > > > for SUBSCRIPTIONRELMAP, and that leads to calling > > > invalidate_syncing_table_states(). > > > > > > 3. If the apply worker manages to fetch no non-ready tables in > > > FetchTableStates() and ignore "table_states_valid = false" from > > > invalidate_syncing_table_states(), then it just misses the invalidation > > > event, so it keeps working without noticing non-ready tables appeared as > > > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() > > skips > > > a loop "foreach(lc, table_states_not_ready) ..." until some other event > > > occurs). > > > > > > pg_usleep(100000) added just below GetSubscriptionRelations(...) > > > proves my analysis -- without it, I need tens of iterations (with 50 > > > tests running in > > > parallel) to catch the failure, but with it, I get the failure on the > > > first iteration. > > > > Thanks for the analysis, I was able to reproduce the issue with the steps you had > > shared. I agree with your analysis. I added some logs to verify that the > > invalidation was getting missed. > > > > I felt that this invalidation is getting ignored because we have used a boolean > > variable here, how about changing it slightly so that table_states_invalid gets > > incremented for every invalidation and then decrementing table_states_invalid > > after getting the non-ready tables like in the attached patch. I was able to > > verify that the test passes with the attached patch. > > Thanks for the patch. > > I am not sure if counting the invalidation number is needed, as even if there > are hundreds of invalidations outside of FetchTableStates, one FetchTableStates > call should reset the count to 0 as it is checking the latest catalog. Another approach I was thinking of is to reset table_states_valid immediately in the beginning of FetchTableStates, so any new invalidations will take care of setting table_states_valid again it again which will be handled in the next iteration of fetching non-ready tables like in the attached patch. Regards, Vigneshj
Attachment
pgsql-hackers by date: