Thread: promote request not received timely on slow Windows machines

promote request not received timely on slow Windows machines

From
Alexander Lakhin
Date:
Hello hackers,

Windows animal drongo reiterated a failure of 002_archiving_standby.pl [1]
recently [2] and this pushed me to run the issue to ground.

The log of the failed test contains:
[17:11:11.519](0.001s) ok 3 - recovery_end_command not executed yet
### Promoting node "standby"
# Running: pg_ctl -D 
C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\data/t_002_archiving_standby_data/pgdata

-l C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\log/002_archiving_standby.log
promote
waiting for server to 

promote.......................................................................................................................................................................................

stopped waiting
pg_ctl: server did not promote in time
[17:20:06.095](534.576s) Bail out!  command "pg_ctl -D 
C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\data/t_002_archiving_standby_data/pgdata

-l C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\log/002_archiving_standby.log
promote"
 
exited with value 1

That is, the test failed due to pg_ctl timeout (180 seconds in this case)
reached. The standby log file contains:
2024-09-29 17:11:10.319 UTC [6408:3] LOG:  recovery restart point at 0/3028BF8
2024-09-29 17:11:10.319 UTC [6408:4] DETAIL:  Last completed transaction was at log time 2024-09-29
17:10:57.783965+00.
The system cannot find the file specified.
2024-09-29 17:11:10.719 UTC [7440:5] 002_archiving.pl LOG: disconnection: session time: 0:00:00.488 user=pgrunner 
database=postgres host=127.0.0.1 port=62549
The system cannot find the file specified.
...
The system cannot find the file specified.
2024-09-29 17:20:08.237 UTC [6684:4] LOG:  received immediate shutdown request
The system cannot find the file specified.
The system cannot find the file specified.

but it doesn't contain the expected log message (which I see when the test
passes successfully):
LOG:  received promote request

I reproduced this failure on a slowed-down Windows VM and found that
StartupProcTriggerHandler(), which should set promote_signaled, might be
not called for a long time, because of missing interrupt checks.

Namely, the only place to check for interrupts during wal processing
in this case is inside WaitForWALToBecomeAvailable():
                     if (!TimestampDifferenceExceeds(last_fail_time, now,
wal_retrieve_retry_interval))
                     {
...
                         elog(LOG, "waiting for WAL to become available at %X/%X",
                              LSN_FORMAT_ARGS(RecPtr));
and
                         (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, ...);
...
                     }

(elog() -> errfinish() -> CHECK_FOR_INTERRUPTS();
WaitLatch() -> WaitEventSetWait() -> WaitEventSetWaitBlock() ->
  pgwin32_dispatch_queued_signals())

With log_min_messages = DEBUG2, other places appear, e.g.:
             elog(DEBUG2, "switched WAL source from %s to %s after %s",

and the test works reliably, but by default the only message logged
there is "waiting for WAL...".

So if the timestamp difference constantly exceeds
wal_retrieve_retry_interval, which set to 100ms in the test,
CHECK_FOR_INTERRUPTS() is not called at all.

I had also wondered, why this test failed only on Windows so far, while we
have even slower machines running Linux or FreeBSD. So I've run the test
on Linux with wal_retrieve_retry_interval decreased to 1ms and a debug
message added inside StartupProcTriggerHandler(), and got:
2024-10-01 07:33:22.301 UTC [3107649] LOG: !!!StartupProcTriggerHandler
2024-10-01 07:33:22.301 UTC [3107649] BACKTRACE:
StartupProcTriggerHandler at startup.c:96:19
wrapper_handler at pqsignal.c:110:2
__restore_rt at ??:0
__GI___wait4 at wait4.c:30:10
do_system at system.c:171:11
RestoreArchivedFile at xlogarchive.c:177:7
existsTimeLineHistory at timeline.c:235:3
findNewestTimeLine at timeline.c:277:6
rescanLatestTimeLine at xlogrecovery.c:4115:14
WaitForWALToBecomeAvailable at xlogrecovery.c:3674:10
...

Thus, signals are processed inside the system() call on this platform, but
this apparently is not so on Windows.


[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-07-26%2013%3A50%3A45
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-29%2015%3A30%3A27

Best regards,
Alexander