Thread: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16678
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 13.0
Operating system:   Windows
Description:

When executing `call vcregress ecpgcheck` (under a heavy load), I get
connect/test5 failures sometimes:
test connect/test4                ... ok           67 ms
test connect/test5                ... stderr FAILED      945 ms
test pgtypeslib/dt_test           ... ok          271 ms

regression.diffs contains:
diff -w -U3 .../src/interfaces/ecpg/test/expected/connect-test5.stderr
.../src/interfaces/ecpg/test/results/connect-test5.stderr
--- .../src/interfaces/ecpg/test/expected/connect-test5.stderr    2020-10-13
21:51:14 +0300
+++ .../src/interfaces/ecpg/test/results/connect-test5.stderr    2020-10-19
19:31:41 +0300
@@ -36,7 +36,9 @@
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ECPGconnect: opening database <DEFAULT> on <DEFAULT> port
<DEFAULT>  for user regress_ecpg_user2
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: could not open database: FATAL:  database
"regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: could not open database: server closed the
connection unexpectedly
+    This probably means the server terminated abnormally
+    before or while processing the request.
 
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed

postmaster.log contains:
...
2020-10-19 19:31:40.150 MSK client backend[928] ecpg/connect-test4 LOG:
could not receive data from client: An existing connection was forcibly
closed by the remote host.
    
2020-10-19 19:31:40.677 MSK client backend[2032] [unknown] FATAL:  database
"regress_ecpg_user2" does not exist
2020-10-19 19:31:41.052 MSK client backend[2532] [unknown] FATAL:  database
"regress_ecpg_user2" does not exist
2020-10-19 19:31:49.834 MSK client backend[2544] ecpg/preproc-whenever
ERROR:  relation "nonexistent" does not exist at character 15
...
So the server didn't crash, it logged the expected message correctly.

The following delay added to pqReadData (src/interfaces/libpq/fe-misc.c):
@@ -662,6 +662,8 @@ pqReadData(PGconn *conn)
                }
        }

+       pg_usleep(10000L); /* wait for 10 msec */
+
        /* OK, try to read some data */
 retry3:
        nread = pqsecure_read(conn, conn->inBuffer + conn->inEnd,

Makes the failure reproduce reliably (on each run) for me.
Also this affects the pg_dump test:
vcregress taptest src/bin/pg_dump
...
t/001_basic.pl ................ ok
t/002_pg_dump.pl .............. 1/6315
#   Failed test 'connecting to a non-existent database: matches'
#   at t/002_pg_dump.pl line 3460.
#                   'pg_dump: error: connection to database "qqq" failed:
server
 closed the connection unexpectedly
#       This probably means the server terminated abnormally
#       before or while processing the request.
# '
#     doesn't match '(?^:pg_dump\:\ error\:\ connection\ to\ database\
\"qqq\"\
failed\:\ FATAL\:\ \ database\ \"qqq\"\ does\ not\ exist)'
t/002_pg_dump.pl .............. 6113/6315 # Looks like you failed 1 test of
6315
.
t/002_pg_dump.pl .............. Dubious, test returned 1 (wstat 256,
0x100)
Failed 1/6315 subtests
t/003_pg_dump_with_server.pl .. ok
...

On Linux the same delay (and even 100 msec) doesn't cause these test
failures.
It seems that on Windows the postmaster process can close a connection
before slow client receives a FATAL message.


Re: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> It seems that on Windows the postmaster process can close a connection
> before slow client receives a FATAL message.

Yeah, we've come to that conclusion before, see for instance

https://www.postgresql.org/message-id/flat/E1iaD8h-0004us-K9%40gemulon.postgresql.org

I haven't heard any non-unpleasant ideas for working around it.

One thought that might or might not improve matters is for
the backend to explicitly close() the socket before exiting.
We intentionally don't do that (cf 268313a95), but maybe
doing it on Windows would be better than the existing issue
--- assuming it fixes the issue, that is.

            regards, tom lane



Re: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
Alexander Lakhin
Date:
19.10.2020 22:18, Tom Lane wrote:
> Yeah, we've come to that conclusion before, see for instance
> https://www.postgresql.org/message-id/flat/E1iaD8h-0004us-K9%40gemulon.postgresql.org
>
> I haven't heard any non-unpleasant ideas for working around it.
>
> One thought that might or might not improve matters is for
> the backend to explicitly close() the socket before exiting.
> We intentionally don't do that (cf 268313a95), but maybe
> doing it on Windows would be better than the existing issue
> --- assuming it fixes the issue, that is.
Yes, close() fixes the issue for me, but I'm afraid that it maybe not
100% reliable (and Amit Kapila in [1] talked about it in the
aforementioned discussion).
Microsoft suggests using shutdown() to assure that all data is sent.
https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-shutdown
And the same is said in not so new, but may be still relevant article:
https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable

So I've also tested on Windows the following version:
        secure_close(MyProcPort);
        shutdown(MyProcPort->sock, SD_SEND);
        for(;;) {
            char buffer[1000];
            int res = recv(MyProcPort->sock, buffer, 1000, 0);
            if (res <= 0)
                break;
        }
        closesocket(MyProcPort->sock);
And it works too. On my Ubuntu localhost I can't reproduce losing a
message (up to 1MB) on FATAL exit, so probably in 2020 such a socket
cherishing is needed only for Windows. But I allow that with with some
other systems and/or a real network the data sent just before exit can
be lost.

Best regards,
Alexander



Re: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes:
> So I've also tested on Windows the following version:
>         secure_close(MyProcPort);
>         shutdown(MyProcPort->sock, SD_SEND);
>         for(;;) {
>             char buffer[1000];
>             int res = recv(MyProcPort->sock, buffer, 1000, 0);
>             if (res <= 0)
>                 break;
>         }
>         closesocket(MyProcPort->sock);
> And it works too.

I'm afraid this cure is probably worse than the disease, because
now the backend's exiting is held hostage by whether the client
closes its socket (and the resulting FIN reaches us, which it
might not if there's a network problem).

It might be safe to do the shutdown() only, since Microsoft
specifies that that doesn't block, but I have no idea if that
would improve matters.

            regards, tom lane



Re: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
Alexander Lakhin
Date:
24.10.2020 20:39, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> So I've also tested on Windows the following version:
>>         secure_close(MyProcPort);
>>         shutdown(MyProcPort->sock, SD_SEND);
>>         for(;;) {
>>             char buffer[1000];
>>             int res = recv(MyProcPort->sock, buffer, 1000, 0);
>>             if (res <= 0)
>>                 break;
>>         }
>>         closesocket(MyProcPort->sock);
>> And it works too.
> I'm afraid this cure is probably worse than the disease, because
> now the backend's exiting is held hostage by whether the client
> closes its socket (and the resulting FIN reaches us, which it
> might not if there's a network problem).
On a next level of this game we could make something like that:
        secure_close(MyProcPort);
        timeout = 1000; // 1 sec
        setsockopt(MyProcPort->sock, SOL_SOCKET, SO_RCVTIMEO,
(char*)&timeout, sizeof(timeout));
        shutdown(MyProcPort->sock, SD_SEND);
        for(;;) {
            char buffer[1000];
            int res = recv(MyProcPort->sock, buffer, 1000, 0);
            if (res <= 0) {
                break;
            }
        }
        closesocket(MyProcPort->sock);
(I can't get this timeout working yet.)

But I'm inclined to stay on the previous level with "shutdown &&
closesocket" as recommended for server side:
https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2
Please look at the draft patch.

By the way, the delay added to pqReadData():
    pg_usleep(10000L); // Wait for 10 ms.
makes the src/test/recovery test fail on Linux too (pg_ctl can't stop
the primary node).
So it seems that slow reading from a socket breaks things not only on
Windows and the fix should be more comprehensive.

Best regards,
Alexander

Attachment

Re: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes:
> But I'm inclined to stay on the previous level with "shutdown &&
> closesocket" as recommended for server side:
> https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2
> Please look at the draft patch.

If that actually fixes things, I'm okay with it --- do you find
that it prevents the test failure even with client-side delays?

            regards, tom lane



Re: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
Alexander Lakhin
Date:
25.10.2020 00:07, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> But I'm inclined to stay on the previous level with "shutdown &&
>> closesocket" as recommended for server side:
>> https://docs.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2
>> Please look at the draft patch.
> If that actually fixes things, I'm okay with it --- do you find
> that it prevents the test failure even with client-side delays?
Yes, the connect/test5 test passes even with 1 sec delay:
C:\src\postgresql\src\tools\msvc>vcregress ecpgcheck
Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

Build started 10/25/2020 7:58:25 AM.
...
============== running regression test queries        ==============
test compat_informix/dec_test     ... ok           27 ms
test compat_informix/charfuncs    ... ok           14 ms
test compat_informix/rfmtdate     ... ok           12 ms
test compat_informix/rfmtlong     ... ok           12 ms
test compat_informix/rnull        ... ok        23337 ms
test compat_informix/sqlda        ... ok        56734 ms
test compat_informix/describe     ... ok        35485 ms
test compat_informix/test_informix ... ok        41592 ms
test compat_informix/test_informix2 ... ok        21336 ms
test compat_oracle/char_array     ... ok        39457 ms
test connect/test2                ... ok        18274 ms
test connect/test3                ... ok        15276 ms
test connect/test4                ... ok         3073 ms
test connect/test5                ... ok        45866 ms
test pgtypeslib/dt_test           ... ok        15200 ms
test pgtypeslib/dt_test2          ... ok           12 ms
test pgtypeslib/num_test          ... ok        11188 ms
...
test sql/prepareas                ... ok       148876 ms
test thread/thread                ... ok        90417 ms
test thread/thread_implicit       ... ok        91290 ms
test thread/prep                  ... ok       258438 ms
test thread/alloc                 ... ok       150070 ms
test thread/descriptor            ... ok           79 ms
============== shutting down postmaster               ==============
============== removing temporary instance            ==============

======================
 All 60 tests passed.
======================

And regarding the src/test/recovery test failure with a 10 ms delay
added... I've increased a log level and get probably the same behaviour
as described in
https://www.postgresql.org/message-id/flat/OFCA523F90.7499E22F-ON4325828E.005573C4-4325828E.0055B412%40iba.by
The primary node can't finish smart shutdown and logs of the primary and
the standby are filling with the following messages:
2020-10-25 07:37:39.830 MSK [1065932] standby_1 DEBUG:  sending
replication keepalive
2020-10-25 07:37:39.840 MSK [1065932] standby_1 DEBUG:  write 0/30000A0
flush 0/3000000 apply 0/3000000 reply_time 2020-10-25 07:37:39.840277+03
2020-10-25 07:37:39.840 MSK [1065932] standby_1 DEBUG:  sending
replication keepalive
2020-10-25 07:37:39.850 MSK [1065932] standby_1 DEBUG:  write 0/30000A0
flush 0/3000000 apply 0/3000000 reply_time 2020-10-25 07:37:39.850389+03
2020-10-25 07:37:39.850 MSK [1065932] standby_1 DEBUG:  sending
replication keepalive
2020-10-25 07:37:39.860 MSK [1065932] standby_1 DEBUG:  write 0/30000A0
flush 0/3000000 apply 0/3000000 reply_time 2020-10-25 07:37:39.860486+03
...

2020-10-25 07:37:39.830 MSK [1065931] DEBUG:  sending write 0/30000A0
flush 0/3000000 apply 0/3000000
2020-10-25 07:37:39.840 MSK [1065931] DEBUG:  sendtime 2020-10-25
07:37:39.83022+03 receipttime 2020-10-25 07:37:39.840262+03 replication
apply delay 0 ms transfer latency 10 ms
2020-10-25 07:37:39.840 MSK [1065931] DEBUG:  sending write 0/30000A0
flush 0/3000000 apply 0/3000000
2020-10-25 07:37:39.850 MSK [1065931] DEBUG:  sendtime 2020-10-25
07:37:39.840331+03 receipttime 2020-10-25 07:37:39.850364+03 replication
apply delay 0 ms transfer latency 10 ms
2020-10-25 07:37:39.850 MSK [1065931] DEBUG:  sending write 0/30000A0
flush 0/3000000 apply 0/3000000
2020-10-25 07:37:39.860 MSK [1065931] DEBUG:  sendtime 2020-10-25
07:37:39.850438+03 receipttime 2020-10-25 07:37:39.860468+03 replication
apply delay 0 ms transfer latency 10 ms
...
 
Though the patch proposed by Kyotaro Horiguchi doesn't help. It this
looks like an issue that should be fixed, I can investigate it further.

Best regards,
Alexander




Re: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows

From
Kyotaro Horiguchi
Date:
At Sun, 25 Oct 2020 08:00:00 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in
> And regarding the src/test/recovery test failure with a 10 ms delay
> added... I've increased a log level and get probably the same behaviour
> as described in
> https://www.postgresql.org/message-id/flat/OFCA523F90.7499E22F-ON4325828E.005573C4-4325828E.0055B412%40iba.by
> The primary node can't finish smart shutdown and logs of the primary and
> the standby are filling with the following messages:
...
>  
> Though the patch proposed by Kyotaro Horiguchi doesn't help. It this
> looks like an issue that should be fixed, I can investigate it further.

Yeah, that patch just rejects to start from a wrong backup. If a
specific legit operation makes such a backup, it should be fixed.

regareds.

--
Kyotaro Horiguchi
NTT Open Source Software Center