Thread: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows
The following bug has been logged on the website: Bug reference: 16678 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 13.0 Operating system: Windows Description: When executing `call vcregress ecpgcheck` (under a heavy load), I get connect/test5 failures sometimes: test connect/test4 ... ok 67 ms test connect/test5 ... stderr FAILED 945 ms test pgtypeslib/dt_test ... ok 271 ms regression.diffs contains: diff -w -U3 .../src/interfaces/ecpg/test/expected/connect-test5.stderr .../src/interfaces/ecpg/test/results/connect-test5.stderr --- .../src/interfaces/ecpg/test/expected/connect-test5.stderr 2020-10-13 21:51:14 +0300 +++ .../src/interfaces/ecpg/test/results/connect-test5.stderr 2020-10-19 19:31:41 +0300 @@ -36,7 +36,9 @@ [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ECPGconnect: opening database <DEFAULT> on <DEFAULT> port <DEFAULT> for user regress_ecpg_user2 [NO_PID]: sqlca: code: 0, state: 00000 -[NO_PID]: ECPGconnect: could not open database: FATAL: database "regress_ecpg_user2" does not exist +[NO_PID]: ECPGconnect: could not open database: server closed the connection unexpectedly + This probably means the server terminated abnormally + before or while processing the request. [NO_PID]: sqlca: code: 0, state: 00000 [NO_PID]: ecpg_finish: connection main closed postmaster.log contains: ... 2020-10-19 19:31:40.150 MSK client backend[928] ecpg/connect-test4 LOG: could not receive data from client: An existing connection was forcibly closed by the remote host. 2020-10-19 19:31:40.677 MSK client backend[2032] [unknown] FATAL: database "regress_ecpg_user2" does not exist 2020-10-19 19:31:41.052 MSK client backend[2532] [unknown] FATAL: database "regress_ecpg_user2" does not exist 2020-10-19 19:31:49.834 MSK client backend[2544] ecpg/preproc-whenever ERROR: relation "nonexistent" does not exist at character 15 ... So the server didn't crash, it logged the expected message correctly. The following delay added to pqReadData (src/interfaces/libpq/fe-misc.c): @@ -662,6 +662,8 @@ pqReadData(PGconn *conn) } } + pg_usleep(10000L); /* wait for 10 msec */ + /* OK, try to read some data */ retry3: nread = pqsecure_read(conn, conn->inBuffer + conn->inEnd, Makes the failure reproduce reliably (on each run) for me. Also this affects the pg_dump test: vcregress taptest src/bin/pg_dump ... t/001_basic.pl ................ ok t/002_pg_dump.pl .............. 1/6315 # Failed test 'connecting to a non-existent database: matches' # at t/002_pg_dump.pl line 3460. # 'pg_dump: error: connection to database "qqq" failed: server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # ' # doesn't match '(?^:pg_dump\:\ error\:\ connection\ to\ database\ \"qqq\"\ failed\:\ FATAL\:\ \ database\ \"qqq\"\ does\ not\ exist)' t/002_pg_dump.pl .............. 6113/6315 # Looks like you failed 1 test of 6315 . t/002_pg_dump.pl .............. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/6315 subtests t/003_pg_dump_with_server.pl .. ok ... On Linux the same delay (and even 100 msec) doesn't cause these test failures. It seems that on Windows the postmaster process can close a connection before slow client receives a FATAL message.
PG Bug reporting form <noreply@postgresql.org> writes: > It seems that on Windows the postmaster process can close a connection > before slow client receives a FATAL message. Yeah, we've come to that conclusion before, see for instance https://www.postgresql.org/message-id/flat/E1iaD8h-0004us-K9%40gemulon.postgresql.org I haven't heard any non-unpleasant ideas for working around it. One thought that might or might not improve matters is for the backend to explicitly close() the socket before exiting. We intentionally don't do that (cf 268313a95), but maybe doing it on Windows would be better than the existing issue --- assuming it fixes the issue, that is. regards, tom lane
19.10.2020 22:18, Tom Lane wrote: > Yeah, we've come to that conclusion before, see for instance > https://www.postgresql.org/message-id/flat/E1iaD8h-0004us-K9%40gemulon.postgresql.org > > I haven't heard any non-unpleasant ideas for working around it. > > One thought that might or might not improve matters is for > the backend to explicitly close() the socket before exiting. > We intentionally don't do that (cf 268313a95), but maybe > doing it on Windows would be better than the existing issue > --- assuming it fixes the issue, that is. Yes, close() fixes the issue for me, but I'm afraid that it maybe not 100% reliable (and Amit Kapila in [1] talked about it in the aforementioned discussion). Microsoft suggests using shutdown() to assure that all data is sent. https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-shutdown And the same is said in not so new, but may be still relevant article: https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable So I've also tested on Windows the following version: secure_close(MyProcPort); shutdown(MyProcPort->sock, SD_SEND); for(;;) { char buffer[1000]; int res = recv(MyProcPort->sock, buffer, 1000, 0); if (res <= 0) break; } closesocket(MyProcPort->sock); And it works too. On my Ubuntu localhost I can't reproduce losing a message (up to 1MB) on FATAL exit, so probably in 2020 such a socket cherishing is needed only for Windows. But I allow that with with some other systems and/or a real network the data sent just before exit can be lost. Best regards, Alexander
Alexander Lakhin <exclusion@gmail.com> writes: > So I've also tested on Windows the following version: > secure_close(MyProcPort); > shutdown(MyProcPort->sock, SD_SEND); > for(;;) { > char buffer[1000]; > int res = recv(MyProcPort->sock, buffer, 1000, 0); > if (res <= 0) > break; > } > closesocket(MyProcPort->sock); > And it works too. I'm afraid this cure is probably worse than the disease, because now the backend's exiting is held hostage by whether the client closes its socket (and the resulting FIN reaches us, which it might not if there's a network problem). It might be safe to do the shutdown() only, since Microsoft specifies that that doesn't block, but I have no idea if that would improve matters. regards, tom lane
24.10.2020 20:39, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> So I've also tested on Windows the following version: >> secure_close(MyProcPort); >> shutdown(MyProcPort->sock, SD_SEND); >> for(;;) { >> char buffer[1000]; >> int res = recv(MyProcPort->sock, buffer, 1000, 0); >> if (res <= 0) >> break; >> } >> closesocket(MyProcPort->sock); >> And it works too. > I'm afraid this cure is probably worse than the disease, because > now the backend's exiting is held hostage by whether the client > closes its socket (and the resulting FIN reaches us, which it > might not if there's a network problem). On a next level of this game we could make something like that: secure_close(MyProcPort); timeout = 1000; // 1 sec setsockopt(MyProcPort->sock, SOL_SOCKET, SO_RCVTIMEO, (char*)&timeout, sizeof(timeout)); shutdown(MyProcPort->sock, SD_SEND); for(;;) { char buffer[1000]; int res = recv(MyProcPort->sock, buffer, 1000, 0); if (res <= 0) { break; } } closesocket(MyProcPort->sock); (I can't get this timeout working yet.) But I'm inclined to stay on the previous level with "shutdown && closesocket" as recommended for server side: https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2 Please look at the draft patch. By the way, the delay added to pqReadData(): pg_usleep(10000L); // Wait for 10 ms. makes the src/test/recovery test fail on Linux too (pg_ctl can't stop the primary node). So it seems that slow reading from a socket breaks things not only on Windows and the fix should be more comprehensive. Best regards, Alexander
Attachment
Alexander Lakhin <exclusion@gmail.com> writes: > But I'm inclined to stay on the previous level with "shutdown && > closesocket" as recommended for server side: > https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2 > Please look at the draft patch. If that actually fixes things, I'm okay with it --- do you find that it prevents the test failure even with client-side delays? regards, tom lane
25.10.2020 00:07, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> But I'm inclined to stay on the previous level with "shutdown && >> closesocket" as recommended for server side: >> https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2 >> Please look at the draft patch. > If that actually fixes things, I'm okay with it --- do you find > that it prevents the test failure even with client-side delays? Yes, the connect/test5 test passes even with 1 sec delay: C:\src\postgresql\src\tools\msvc>vcregress ecpgcheck Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Framework Copyright (C) Microsoft Corporation. All rights reserved. Build started 10/25/2020 7:58:25 AM. ... ============== running regression test queries ============== test compat_informix/dec_test ... ok 27 ms test compat_informix/charfuncs ... ok 14 ms test compat_informix/rfmtdate ... ok 12 ms test compat_informix/rfmtlong ... ok 12 ms test compat_informix/rnull ... ok 23337 ms test compat_informix/sqlda ... ok 56734 ms test compat_informix/describe ... ok 35485 ms test compat_informix/test_informix ... ok 41592 ms test compat_informix/test_informix2 ... ok 21336 ms test compat_oracle/char_array ... ok 39457 ms test connect/test2 ... ok 18274 ms test connect/test3 ... ok 15276 ms test connect/test4 ... ok 3073 ms test connect/test5 ... ok 45866 ms test pgtypeslib/dt_test ... ok 15200 ms test pgtypeslib/dt_test2 ... ok 12 ms test pgtypeslib/num_test ... ok 11188 ms ... test sql/prepareas ... ok 148876 ms test thread/thread ... ok 90417 ms test thread/thread_implicit ... ok 91290 ms test thread/prep ... ok 258438 ms test thread/alloc ... ok 150070 ms test thread/descriptor ... ok 79 ms ============== shutting down postmaster ============== ============== removing temporary instance ============== ====================== All 60 tests passed. ====================== And regarding the src/test/recovery test failure with a 10 ms delay added... I've increased a log level and get probably the same behaviour as described in https://www.postgresql.org/message-id/flat/OFCA523F90.7499E22F-ON4325828E.005573C4-4325828E.0055B412%40iba.by The primary node can't finish smart shutdown and logs of the primary and the standby are filling with the following messages: 2020-10-25 07:37:39.830 MSK [1065932] standby_1 DEBUG: sending replication keepalive 2020-10-25 07:37:39.840 MSK [1065932] standby_1 DEBUG: write 0/30000A0 flush 0/3000000 apply 0/3000000 reply_time 2020-10-25 07:37:39.840277+03 2020-10-25 07:37:39.840 MSK [1065932] standby_1 DEBUG: sending replication keepalive 2020-10-25 07:37:39.850 MSK [1065932] standby_1 DEBUG: write 0/30000A0 flush 0/3000000 apply 0/3000000 reply_time 2020-10-25 07:37:39.850389+03 2020-10-25 07:37:39.850 MSK [1065932] standby_1 DEBUG: sending replication keepalive 2020-10-25 07:37:39.860 MSK [1065932] standby_1 DEBUG: write 0/30000A0 flush 0/3000000 apply 0/3000000 reply_time 2020-10-25 07:37:39.860486+03 ... 2020-10-25 07:37:39.830 MSK [1065931] DEBUG: sending write 0/30000A0 flush 0/3000000 apply 0/3000000 2020-10-25 07:37:39.840 MSK [1065931] DEBUG: sendtime 2020-10-25 07:37:39.83022+03 receipttime 2020-10-25 07:37:39.840262+03 replication apply delay 0 ms transfer latency 10 ms 2020-10-25 07:37:39.840 MSK [1065931] DEBUG: sending write 0/30000A0 flush 0/3000000 apply 0/3000000 2020-10-25 07:37:39.850 MSK [1065931] DEBUG: sendtime 2020-10-25 07:37:39.840331+03 receipttime 2020-10-25 07:37:39.850364+03 replication apply delay 0 ms transfer latency 10 ms 2020-10-25 07:37:39.850 MSK [1065931] DEBUG: sending write 0/30000A0 flush 0/3000000 apply 0/3000000 2020-10-25 07:37:39.860 MSK [1065931] DEBUG: sendtime 2020-10-25 07:37:39.850438+03 receipttime 2020-10-25 07:37:39.860468+03 replication apply delay 0 ms transfer latency 10 ms ... Though the patch proposed by Kyotaro Horiguchi doesn't help. It this looks like an issue that should be fixed, I can investigate it further. Best regards, Alexander
At Sun, 25 Oct 2020 08:00:00 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in > And regarding the src/test/recovery test failure with a 10 ms delay > added... I've increased a log level and get probably the same behaviour > as described in > https://www.postgresql.org/message-id/flat/OFCA523F90.7499E22F-ON4325828E.005573C4-4325828E.0055B412%40iba.by > The primary node can't finish smart shutdown and logs of the primary and > the standby are filling with the following messages: ... > > Though the patch proposed by Kyotaro Horiguchi doesn't help. It this > looks like an issue that should be fixed, I can investigate it further. Yeah, that patch just rejects to start from a wrong backup. If a specific legit operation makes such a backup, it should be fixed. regareds. -- Kyotaro Horiguchi NTT Open Source Software Center