Thread: Windows vs recovery tests
For some considerable time the recovery tests have been at best flaky on Windows, and at worst disastrous (i.e. they can hang rather than just fail). It's a problem I worked around on my buildfarm animals by disabling the tests, hoping to find time to get back to analysing the problem. But now we are seeing failures on the cfbot too (e.g. https://cirrus-ci.com/task/5860692694663168 and https://cirrus-ci.com/task/5316745152954368 ) so I think we need to spend some effort on finding out what's going on here. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > For some considerable time the recovery tests have been at best flaky on > Windows, and at worst disastrous (i.e. they can hang rather than just > fail). How long is "some considerable time"? I'm wondering if this isn't the same issue under discussion in https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com regards, tom lane
Hello.
It is also could be related - https://www.postgresql.org/message-id/flat/20220112112425.pgzymqcgdy62e7m3%40jrouhaud#097b54a539ac3091ca4e4ed8ce9ab89c (both Windows and Linux cases.
Best regards,
Michail.
On 1/12/22 16:15, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> For some considerable time the recovery tests have been at best flaky on >> Windows, and at worst disastrous (i.e. they can hang rather than just >> fail). > How long is "some considerable time"? I'm wondering if this isn't > the same issue under discussion in > > https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com > > many months - this isn't a new thing. I'm going to set up a system where I run the test in a fairly tight loop and see if I can find out more. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2022-01-12 14:34:00 -0500, Andrew Dunstan wrote: > For some considerable time the recovery tests have been at best flaky on > Windows, and at worst disastrous (i.e. they can hang rather than just > fail). It's a problem I worked around on my buildfarm animals by > disabling the tests, hoping to find time to get back to analysing the > problem. But now we are seeing failures on the cfbot too (e.g. > https://cirrus-ci.com/task/5860692694663168 and > https://cirrus-ci.com/task/5316745152954368 ) so I think we need to > spend some effort on finding out what's going on here. I'm somewhat certain that this is caused by assertions or aborts hanging with a GUI popup, e.g. due to a check in the CRT. I saw these kind of hangs a lot in the aio development tree before I merged the changes to change error/abort handling on windows. Before the recent CI changes cfbot ran windows tests without assertions, which - besides just running fewer tests - explains having fewer such hang before, because there's more sources of such error popups in the debug CRT. It'd be nice if somebody could look at the patch and discussion in https://www.postgresql.org/message-id/20211005193033.tg4pqswgvu3hcolm%40alap3.anarazel.de The debugging information for the cirrus-ci tasks has a list of processes. E.g. for https://cirrus-ci.com/task/5860692694663168 there's 1 agent.exe 1 CExecSvc.exe 1 csrss.exe 1 fontdrvhost.exe 1 lsass.exe 1 msdtc.exe 1 psql.exe 1 services.exe 1 wininit.exe 9 cmd.exe 9 perl.exe 9 svchost.exe 49 postgres.exe processes. So we know that some tests were actually still in progress... It's particularly interesting that there's a psql process still hanging around... Before I "simplified" that away, the CI patch ran all tests with a shorter individual timeout than the overall CI timeout, so we'd see error logs etc. Perhaps that was a mistake to remove. IIRC I did something like "C:\Program Files\Git\usr\bin\timeout.exe" -v --kill-after=35m 30m perl path/to/vcregress.pl ... Perhaps worth re-adding? Greetings, Andres Freund
Hi, On 2022-01-12 15:58:26 -0800, Andres Freund wrote: > On 2022-01-12 14:34:00 -0500, Andrew Dunstan wrote: > > For some considerable time the recovery tests have been at best flaky on > > Windows, and at worst disastrous (i.e. they can hang rather than just > > fail). It's a problem I worked around on my buildfarm animals by > > disabling the tests, hoping to find time to get back to analysing the > > problem. But now we are seeing failures on the cfbot too (e.g. > > https://cirrus-ci.com/task/5860692694663168 and > > https://cirrus-ci.com/task/5316745152954368 ) so I think we need to > > spend some effort on finding out what's going on here. > > I'm somewhat certain that this is caused by assertions or aborts hanging with > a GUI popup, e.g. due to a check in the CRT. Oh, that was only about https://cirrus-ci.com/task/5860692694663168 not https://cirrus-ci.com/task/5316745152954368 Looking through the recent recovery failures that were just on windows, I see three different "classes" of recovery test failures: 1) Tests sometimes never finish, resulting in CI timing out 2) Tests sometimes finish, but t/001_stream_rep.pl fails 3) Tests fail with patch specific issues (e.g. 36/2096, 36/3461, 36/3459) From the cases I looked the failures in 1) always have a successful t/001_stream_rep.pl. This makes me think that we're likely at two separate types of problems? One might think that https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3464 conflicts with the above grouping. But all but the currently last failure were due a compiler warning in an earlier version of the patch. There's one interesting patch that also times out just on windows, albeit in another test group: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/2096 This IMO looks likely to be a bug in psql introduced by that patch. Greetings, Andres Freund
Hi, On 2022-01-12 18:25:26 -0800, Andres Freund wrote: > There's one interesting patch that also times out just on windows, albeit in > another test group: > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/2096 > > This IMO looks likely to be a bug in psql introduced by that patch. vcregress doesn't say which tests it's about to run unfortunately, but comparing a successful run (on another branch) says that the test running after pgbench are the psql tests. I pushed a branch to my github repository containing cfbot's commit and one that runs the psql tests in isolation, under a timeout... Which predictably failed. But at least we see the logs... https://cirrus-ci.com/task/6723083204558848?logs=psql_test_tcp#L15 based on the log files it looks like psql's 001_basic test did run # Test clean handling of unsupported replication command responses psql_like( $node, 'handling of unexpected PQresultStatus', 'START_REPLICATION 0/0', undef, qr/unexpected PQresultStatus: 8$/); https://api.cirrus-ci.com/v1/artifact/task/6723083204558848/log/src/bin/psql/tmp_check/log/001_basic_main.log 2022-01-13 03:28:45.973 GMT [604][walsender] [001_basic.pl][3/0:0] STATEMENT: START_REPLICATION 0/0 https://api.cirrus-ci.com/v1/artifact/task/6723083204558848/tap/src/bin/psql/tmp_check/log/regress_log_001_basic the last log entry in tap log is ok 23 - \help with argument: stdout matches So it looks like psql is hanging somewhere after that. I assume with an error popup that nobody can click on :/. Greetings, Andres Freund
Hi, On 2022-01-12 20:03:14 -0800, Andres Freund wrote: > So it looks like psql is hanging somewhere after that. I assume with an error > popup that nobody can click on :/. Not quite. Psql is actually just logging output in an endless loop. I connected with cdb.exe. kP: 00000000`007fd3c8 00007ffc`0d00f13a ntdll!NtWriteFile+0x14 00000000`007fd3d0 00007ffc`03978ec3 KERNELBASE!WriteFile+0x7a 00000000`007fd440 00007ffc`03979d21 ucrtbased!write_text_ansi_nolock( int fh = 0n2, char * buffer = 0x00000000`007febb0 " ???", unsigned int buffer_size = 1)+0x183 00000000`007fe8f0 00007ffc`039798a7 ucrtbased!_write_nolock( int fh = 0n2, void * buffer = 0x00000000`007febb0, unsigned int buffer_size = 1, class __crt_cached_ptd_host * ptd = 0x00000000`007fef40)+0x451 00000000`007fea80 00007ffc`03920e1d ucrtbased!_write_internal( int fh = 0n2, void * buffer = 0x00000000`007febb0, unsigned int size = 1, class __crt_cached_ptd_host * ptd = 0x00000000`007fef40)+0x377 00000000`007feb20 00007ffc`0392090e ucrtbased!write_buffer_nolock<char>( char c = 0n32 ' ', class __crt_stdio_stream stream = class __crt_stdio_stream, class __crt_cached_ptd_host * ptd = 0x00000000`007fef40)+0x27d 00000000`007febb0 00007ffc`03921242 ucrtbased!common_flush_and_write_nolock<char>( int c = 0n32, class __crt_stdio_stream stream = class __crt_stdio_stream, class __crt_cached_ptd_host * ptd = 0x00000000`007fef40)+0x22e 00000000`007fec20 00007ffc`038ddf5a ucrtbased!__acrt_stdio_flush_and_write_narrow_nolock( int c = 0n32, struct _iobuf * stream = 0x00007ffc`03a27ce0, class __crt_cached_ptd_host * ptd = 0x00000000`007fef40)+0x32 00000000`007fec60 00007ffc`038dd5a3 ucrtbased!_fwrite_nolock_internal( void * buffer = 0x00000000`007ff020, unsigned int64 element_size = 1, unsigned int64 element_count = 7, struct _iobuf * public_stream = 0x00007ffc`03a27ce0, class __crt_cached_ptd_host * ptd = 0x00000000`007fef40)+0x79a 00000000`007fed60 00007ffc`038dd426 ucrtbased!<lambda_26974eb511f701c600fccfa2a97a8e1b>::operator()(void)+0x73 00000000`007fedd0 00007ffc`038dd4a8 ucrtbased!__crt_seh_guarded_call<unsigned __int64>::operator()<<lambda_a2589f19c 515cac03caf6db9c38355e9>,<lambda_26974eb511f701c600fccfa2a97a8e1b> &,<lambda_ad9ce2f38261e34e8a422b9cc35dfe8d> >( class __acrt_lock_stream_and_call::__l2::<lambda_a2589f19c515cac03caf6db9c38355e9> * setup = 0x0 0000000`007fee58, class _fwrite_internal::__l2::<lambda_26974eb511f701c600fccfa2a97a8e1b> * action = 0x00000000`00 7feec0, class __acrt_lock_stream_and_call::__l2::<lambda_ad9ce2f38261e34e8a422b9cc35dfe8d> * cleanup = 0 x00000000`007fee50)+0x36 00000000`007fee10 00007ffc`038dd72d ucrtbased!__acrt_lock_stream_and_call<<lambda_26974eb511f701c600fccfa2a97a8e1b> >( struct _iobuf * stream = 0x00007ffc`03a27ce0, class _fwrite_internal::__l2::<lambda_26974eb511f701c600fccfa2a97a8e1b> * action = 0x00000000`00 7feec0)+0x58 00000000`007fee70 00007ffc`038de046 ucrtbased!_fwrite_internal( void * buffer = 0x00000000`007ff020, unsigned int64 size = 1, unsigned int64 count = 7, struct _iobuf * stream = 0x00007ffc`03a27ce0, class __crt_cached_ptd_host * ptd = 0x00000000`007fef40)+0x15d 00000000`007fef00 00000001`4004a639 ucrtbased!fwrite( void * buffer = 0x00000000`007ff020, unsigned int64 size = 1, unsigned int64 count = 7, struct _iobuf * stream = 0x00007ffc`03a27ce0)+0x56 00000000`007fef90 00000001`4004a165 psql!flushbuffer( struct PrintfTarget * target = 0x00000000`007feff8)+0x59 00000000`007fefd0 00000001`4004a1e6 psql!pg_vfprintf( struct _iobuf * stream = 0x00007ffc`03a27ce0, char * fmt = 0x00000001`40094268 "error: ", char * args = 0x00000000`007ff4a0 "@???")+0xa5 00000000`007ff450 00000001`40045962 psql!pg_fprintf( struct _iobuf * stream = 0x00007ffc`03a27ce0, char * fmt = 0x00000001`40094268 "error: ")+0x36 00000000`007ff490 00000001`40045644 psql!pg_log_generic_v( pg_log_level level = PG_LOG_ERROR (0n4), char * fmt = 0x00000001`40062e90 "unexpected PQresultStatus: %d", char * ap = 0x00000000`007ff540 "???")+0x302 00000000`007ff4f0 00000001`4000ef1f psql!pg_log_generic( pg_log_level level = PG_LOG_ERROR (0n4), char * fmt = 0x00000001`40062e90 "unexpected PQresultStatus: %d")+0x34 00000000`007ff530 00000001`4000e794 psql!AcceptResult( struct pg_result * result = 0x00000000`0015af90, bool show_error = false)+0x9f 00000000`007ff580 00000001`4000c8fe psql!SendQueryAndProcessResults( char * query = 0x00000000`00107570 "START_REPLICATION 0/0", double * pelapsed_msec = 0x00000000`007ff718, bool is_watch = false, struct printQueryOpt * opt = 0x00000000`00000000, struct _iobuf * printQueryFout = 0x00000000`00000000, bool * tx_ended = 0x00000000`007ff6b3)+0x1a4 00000000`007ff680 00000001`40024104 psql!SendQuery( char * query = 0x00000000`00107570 "START_REPLICATION 0/0")+0x42e 00000000`007ff750 00000001`40001524 psql!MainLoop( struct _iobuf * source = 0x00007ffc`03a27c30)+0xf84 00000000`007ff890 00000001`40032903 psql!process_file( char * filename = 0x00000001`400618e0 "<stdin>", bool use_relative_path = false)+0x274 00000000`007ffcf0 00000001`400503f9 psql!main( int argc = 0n8, char ** argv = 0x00000000`0012a750)+0xc43 00000000`007ffe10 00000001`4005034e psql!invoke_main(void)+0x39 00000000`007ffe60 00000001`4005020e psql!__scrt_common_main_seh(void)+0x12e 00000000`007ffed0 00000001`4005046e psql!__scrt_common_main(void)+0xe 00000000`007fff00 00007ffc`0d2a7974 psql!mainCRTStartup( void * __formal = 0x00000000`002e6000)+0xe 00000000`007fff30 00007ffc`0fe1a2f1 KERNEL32!BaseThreadInitThunk+0x14 00000000`007fff60 00000000`00000000 ntdll!RtlUserThreadStart+0x21 bp psql!pg_log_generic 0:000> k3 Child-SP RetAddr Call Site 00000000`007ff528 00000001`4000ef1f psql!pg_log_generic [C:\cirrus\src\common\logging.c @ 198] 00000000`007ff530 00000001`4000e794 psql!AcceptResult+0x9f [C:\cirrus\src\bin\psql\common.c @ 385] 00000000`007ff580 00000001`4000c8fe psql!SendQueryAndProcessResults+0x1a4 [C:\cirrus\src\bin\psql\common.c @ 1163] 0:000> g Breakpoint 0 hit psql!pg_log_generic: 00000001`40045610 4889542410 mov qword ptr [rsp+10h],rdx ss:00000000`007ff538=0000000100000006 0:000> k3 Child-SP RetAddr Call Site 00000000`007ff528 00000001`4000ef1f psql!pg_log_generic [C:\cirrus\src\common\logging.c @ 198] 00000000`007ff530 00000001`4000e794 psql!AcceptResult+0x9f [C:\cirrus\src\bin\psql\common.c @ 385] 00000000`007ff580 00000001`4000c8fe psql!SendQueryAndProcessResults+0x1a4 [C:\cirrus\src\bin\psql\common.c @ 1163] Ah, I see the bug. It's a use-after-free introduced in the patch: SendQueryAndProcessResults(const char *query, double *pelapsed_msec, bool is_watch, const printQueryOpt *opt, FILE *printQueryFout, bool *tx_ended) ... /* first result */ result = PQgetResult(pset.db); while (result != NULL) ... if (!AcceptResult(result, false)) { ... ClearOrSaveResult(result); success = false; /* and switch to next result */ result_status = PQresultStatus(result); if (result_status == PGRES_COPY_BOTH || result_status == PGRES_COPY_OUT || result_status == PGRES_COPY_IN) So we called ClearOrSaveResult() with did a PQclear(), and then we go and call PQresultStatus(). So this really is unrelated to CI. I'll mention this message in the other thread. Greetings, Andres Freund