promote request not received timely on slow Windows machines - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | promote request not received timely on slow Windows machines |
Date | |
Msg-id | c5df76d0-2456-9ba9-04b2-76aa142d65f7@gmail.com Whole thread Raw |
List | pgsql-hackers |
Hello hackers, Windows animal drongo reiterated a failure of 002_archiving_standby.pl [1] recently [2] and this pushed me to run the issue to ground. The log of the failed test contains: [17:11:11.519](0.001s) ok 3 - recovery_end_command not executed yet ### Promoting node "standby" # Running: pg_ctl -D C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\data/t_002_archiving_standby_data/pgdata -l C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\log/002_archiving_standby.log promote waiting for server to promote....................................................................................................................................................................................... stopped waiting pg_ctl: server did not promote in time [17:20:06.095](534.576s) Bail out! command "pg_ctl -D C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\data/t_002_archiving_standby_data/pgdata -l C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\log/002_archiving_standby.log promote" exited with value 1 That is, the test failed due to pg_ctl timeout (180 seconds in this case) reached. The standby log file contains: 2024-09-29 17:11:10.319 UTC [6408:3] LOG: recovery restart point at 0/3028BF8 2024-09-29 17:11:10.319 UTC [6408:4] DETAIL: Last completed transaction was at log time 2024-09-29 17:10:57.783965+00. The system cannot find the file specified. 2024-09-29 17:11:10.719 UTC [7440:5] 002_archiving.pl LOG: disconnection: session time: 0:00:00.488 user=pgrunner database=postgres host=127.0.0.1 port=62549 The system cannot find the file specified. ... The system cannot find the file specified. 2024-09-29 17:20:08.237 UTC [6684:4] LOG: received immediate shutdown request The system cannot find the file specified. The system cannot find the file specified. but it doesn't contain the expected log message (which I see when the test passes successfully): LOG: received promote request I reproduced this failure on a slowed-down Windows VM and found that StartupProcTriggerHandler(), which should set promote_signaled, might be not called for a long time, because of missing interrupt checks. Namely, the only place to check for interrupts during wal processing in this case is inside WaitForWALToBecomeAvailable(): if (!TimestampDifferenceExceeds(last_fail_time, now, wal_retrieve_retry_interval)) { ... elog(LOG, "waiting for WAL to become available at %X/%X", LSN_FORMAT_ARGS(RecPtr)); and (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, ...); ... } (elog() -> errfinish() -> CHECK_FOR_INTERRUPTS(); WaitLatch() -> WaitEventSetWait() -> WaitEventSetWaitBlock() -> pgwin32_dispatch_queued_signals()) With log_min_messages = DEBUG2, other places appear, e.g.: elog(DEBUG2, "switched WAL source from %s to %s after %s", and the test works reliably, but by default the only message logged there is "waiting for WAL...". So if the timestamp difference constantly exceeds wal_retrieve_retry_interval, which set to 100ms in the test, CHECK_FOR_INTERRUPTS() is not called at all. I had also wondered, why this test failed only on Windows so far, while we have even slower machines running Linux or FreeBSD. So I've run the test on Linux with wal_retrieve_retry_interval decreased to 1ms and a debug message added inside StartupProcTriggerHandler(), and got: 2024-10-01 07:33:22.301 UTC [3107649] LOG: !!!StartupProcTriggerHandler 2024-10-01 07:33:22.301 UTC [3107649] BACKTRACE: StartupProcTriggerHandler at startup.c:96:19 wrapper_handler at pqsignal.c:110:2 __restore_rt at ??:0 __GI___wait4 at wait4.c:30:10 do_system at system.c:171:11 RestoreArchivedFile at xlogarchive.c:177:7 existsTimeLineHistory at timeline.c:235:3 findNewestTimeLine at timeline.c:277:6 rescanLatestTimeLine at xlogrecovery.c:4115:14 WaitForWALToBecomeAvailable at xlogrecovery.c:3674:10 ... Thus, signals are processed inside the system() call on this platform, but this apparently is not so on Windows. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-07-26%2013%3A50%3A45 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-29%2015%3A30%3A27 Best regards, Alexander
pgsql-hackers by date: