pg_basebackup WAL streamer shutdown is bogus - leading to slow tests - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | pg_basebackup WAL streamer shutdown is bogus - leading to slow tests |
Date | |
Msg-id | 20220116092207.me3kihxgjcbyshae@alap3.anarazel.de Whole thread Raw |
Responses |
Re: pg_basebackup WAL streamer shutdown is bogus - leading to slow tests
Re: pg_basebackup WAL streamer shutdown is bogus - leading to slow tests |
List | pgsql-hackers |
Hi, I was wondering in [1] what we could do about the slowest tests on windows. On 2021-12-31 11:25:28 -0800, Andres Freund wrote: > Picking a random successful cfbot run [1] I see the following tap tests taking > more than 20 seconds: > > 67188 ms pg_basebackup t/010_pg_basebackup.pl > 59710 ms recovery t/001_stream_rep.pl Comparing these times to measurements taken on my normal linux workstation, something seemed just *very* off, even with a slow CI instance and windows in the mix. A bunch of printf debugging later, I realized the problem is that several of the pg_basebackups in tests take a *long* time. E.g. for t/001_stream_rep.pl the backups from the standby each take just over 10s. That's awfully specific... # Taking pg_basebackup my_backup from node "standby_1" # Running: pg_basebackup -D C:/dev/postgres/./tmp_check/t_001_stream_rep_standby_1_data/backup/my_backup -h C:/Users/myadmin/AppData/Local/Temp/yba26PBYX1-p 59181 --checkpoint fast --no-sync --label my_backup -v # ran in 10.145s # Backup finished This reproduceably happens and it's *not* related to the socket shutdown() changes we've been debugging lately - even after a revert the problem persists. Because our logging for basebackups is quite weak, both for server and client side, I needed to add a fair bit more debugging to figure it out: pg_basebackup: wait to finish at 0.492 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: stream poll timeout 10.112 The problem is that there's just no implemented way to timely shutdown the WAL streaming thread in pg_basebackup. The code in pg_basebackup.c says: if (verbose) pg_log_info("waiting for background process to finish streaming ..."); ... /* * On Windows, since we are in the same process, we can just store the * value directly in the variable, and then set the flag that says * it's there. */ ... xlogendptr = ((uint64) hi) << 32 | lo; InterlockedIncrement(&has_xlogendptr); But just setting a variable doesn't do much if the thread is in HandleCopyStream()->CopyStreamPoll()->select() The only reason we ever succeed shutting down, without more WAL coming in, is that pg_basebackup defaults to sending a status message every 10 seconds. At which point the thread sees has_xlogendptr = true, and shuts down. A test specific workaround would be to just add --status-interval=1 to Cluster.pm::backup(). But that seems very unsatisfying. I don't immediately see a solution for this, other than to add StreamCtl->stop_event (mirroring ->stop_socket) and then convert CopyStreamPoll() to use WaitForMultipleObjects(). Microsoft's select() doesn't support pipes and there's no socketpair(). Any more straightforward ideas? From a cursory look at history, it used to be that pg_basebackup had this behaviour on all platforms, but it got fixed for other platforms in 7834d20b57a by Tom (assuming the problem wasn't present there). Greetings, Andres Freund [1] https://postgr.es/m/20211231192528.wirwj4qaaw3ted5g%40alap3.anarazel.de
pgsql-hackers by date: