Race condition in FetchTableStates() breaks synchronization of subscription tables - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Race condition in FetchTableStates() breaks synchronization of subscription tables |
Date | |
Msg-id | 711a6afe-edb7-1211-cc27-1bef8239eec7@gmail.com Whole thread Raw |
Responses |
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
|
List | pgsql-hackers |
Hello hackers, After determining a possible cause for intermittent failures of the test subscription/031_column_list [1], I was wondering what makes another subscription test (014_binary) fail on the buildfarm: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39 All those failures caused by a timeout when waiting for a message expected in _subscriber.log. For example, in the snakefly's case: [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269. _subscriber.log contains: 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical replication apply worker" (PID 409999) exited with exit code 1 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply worker for subscription "tsub" has started ... While _subscriber.log from a successful test run contains: 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres database=postgres host=[local] 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply worker for subscription "tsub" has started 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table synchronization worker for subscription "tsub", table "test_mismatching_types" has started 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data format 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY test_mismatching_types, line 1, column a In this case, "logical replication apply worker for subscription "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later. I've managed to reproduce this failure locally by running multiple tests in parallel, and my analysis shows that it is caused by a race condition when accessing variable table_states_valid inside tablesync.c. tablesync.c does the following with table_states_valid: /* * Callback from syscache invalidation. */ void invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) { table_states_valid = false; } ... static bool FetchTableStates(bool *started_tx) { ... if (!table_states_valid) { ... /* Fetch all non-ready tables. */ rstates = GetSubscriptionRelations(MySubscription->oid, true); ... table_states_valid = true; } So, when syscache invalidation occurs inside the code block "if (!table_states_valid)", that invalidation is effectively ignored. In a failed case I observe the following events: 1. logical replication apply worker performs LogicalRepApplyLoop() -> process_syncing_tables() -> process_syncing_tables_for_apply() -> FetchTableStates() periodically. 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache for SUBSCRIPTIONRELMAP, and that leads to calling invalidate_syncing_table_states(). 3. If the apply worker manages to fetch no non-ready tables in FetchTableStates() and ignore "table_states_valid = false" from invalidate_syncing_table_states(), then it just misses the invalidation event, so it keeps working without noticing non-ready tables appeared as the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips a loop "foreach(lc, table_states_not_ready) ..." until some other event occurs). pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my analysis -- without it, I need tens of iterations (with 50 tests running in parallel) to catch the failure, but with it, I get the failure on the first iteration. (Reproduced on REL_16_STABLE..master, where the test 014_binary tries to "Refresh the publication to trigger the tablesync", but I think the race condition exists in the previous versions as well.) [1] https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com Best regards, Alexander
pgsql-hackers by date: