Thread: Sometimes the output to the stdout in Windows disappears
Hi! Sometimes the output to stdout on Windows on multicore machines does not go through after connecting and disconnecting to the server using the PQconnectdbParams and PQfinish functions. I tested on 6 cores. First we appeared this in some pgbench tests. Then we found that this happens on any console program using PQconnectdbParams and PQfinish. I am attaching a patch and a script for testing. -- Victor Spirin Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Attachment
Victor Spirin <v.spirin@postgrespro.ru> writes: > Sometimes the output to stdout on Windows on multicore machines does not > go through after connecting and disconnecting to the server using the > PQconnectdbParams and PQfinish functions. I tested on 6 cores. Hm, why is this not Microsoft's bug to solve? I do wonder if this report is related to the intermittent ecpg failures we see on Windows machines, such as [1]. The details vary, but it's always a case of a .stdout file ending up empty when it should not be. I'd supposed though that it must be something specific to ecpg, since we never see anything like that anywhere but the ecpg tests. Even if you posit that libpq is doing something that somehow compromises stdio, that should affect psql-based tests too. > I am attaching a patch and a script for testing. > [ forced fflush in every snprintf call ] My goodness, that's a large hammer you're swinging. What effects has this kluge got on performance? While I think you may be on to something, this seems like a truly horrid way to "fix" it. We need to dig down further and understand what is actually happening. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-08-13%2022%3A15%3A05
I only found that calling WSACleanup instead of PQfinish has the same effect. I don't see any big performance issues due to the extra fflush call in this place. I will be glad to find a more elegant solution. Victor Spirin Postgres Professional:http://www.postgrespro.com The Russian Postgres Company 31.08.2020 21:31, Tom Lane пишет: > Victor Spirin <v.spirin@postgrespro.ru> writes: >> Sometimes the output to stdout on Windows on multicore machines does not >> go through after connecting and disconnecting to the server using the >> PQconnectdbParams and PQfinish functions. I tested on 6 cores. > Hm, why is this not Microsoft's bug to solve? > > I do wonder if this report is related to the intermittent ecpg failures > we see on Windows machines, such as [1]. The details vary, but it's > always a case of a .stdout file ending up empty when it should not be. > I'd supposed though that it must be something specific to ecpg, since > we never see anything like that anywhere but the ecpg tests. Even if > you posit that libpq is doing something that somehow compromises stdio, > that should affect psql-based tests too. > >> I am attaching a patch and a script for testing. >> [ forced fflush in every snprintf call ] > My goodness, that's a large hammer you're swinging. What effects has this > kluge got on performance? > > While I think you may be on to something, this seems like a truly horrid > way to "fix" it. We need to dig down further and understand what is > actually happening. > > regards, tom lane > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-08-13%2022%3A15%3A05 > >
I happened to try googling for other similar reports, and I found a very interesting recent thread here: https://github.com/nodejs/node/issues/33166 It might not have the same underlying cause, of course, but it sure sounds familiar. If Node.js are really seeing the same effect, that would point to an underlying Windows bug rather than anything Postgres is doing wrong. It doesn't look like the Node.js crew got any closer to understanding the issue than we have, unfortunately. They made their problem mostly go away by reverting a seemingly-unrelated patch. But I can't help thinking that it's a timing-related bug, and that patch was just unlucky enough to change the timing of their tests so that they saw the failure frequently. regards, tom lane
Hello hackers, 13.09.2020 21:37, Tom Lane wrote: > I happened to try googling for other similar reports, and I found > a very interesting recent thread here: > > https://github.com/nodejs/node/issues/33166 > > It might not have the same underlying cause, of course, but it sure > sounds familiar. If Node.js are really seeing the same effect, > that would point to an underlying Windows bug rather than anything > Postgres is doing wrong. > > It doesn't look like the Node.js crew got any closer to > understanding the issue than we have, unfortunately. They made > their problem mostly go away by reverting a seemingly-unrelated > patch. But I can't help thinking that it's a timing-related bug, > and that patch was just unlucky enough to change the timing of > their tests so that they saw the failure frequently. I've managed to make a simple reproducer. Please look at the patch attached. There are two things crucial for reproducing the bug: ioctlsocket(sock, FIONBIO, &ioctlsocket_ret); // from pgwin32_socket() and WSACleanup(); I still can't understand what affects the effect. With this reproducer I get: vcregress taptest src\test\modules\connect ... t/000_connect.pl .. # test # t/000_connect.pl .. 13346/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 16714/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 26216/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 30077/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 36505/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 43647/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 53070/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 54402/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 55685/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 83193/100000 # Failed test at t/000_connect.pl line 24. t/000_connect.pl .. 99992/100000 # Looks like you failed 10 tests of 100000. t/000_connect.pl .. Dubious, test returned 10 (wstat 2560, 0xa00) Failed 10/100000 subtests But in our test farm the pg_bench test (from the installcheck-world suite that we run with using msys) can fail roughly on each third run. Perhaps it depends on I/O load. It seems, that searching files/scanning disk in parallel increases the probability of the glitch. I see no solution for this on the postgres side for now, but this information about Windows quirks could be useful in case someone stumbled upon it too. Best regards, Alexander
Attachment
Alexander Lakhin <exclusion@gmail.com> writes: > I've managed to make a simple reproducer. Please look at the patch attached. > There are two things crucial for reproducing the bug: > ioctlsocket(sock, FIONBIO, &ioctlsocket_ret); // from pgwin32_socket() > and > WSACleanup(); Oh, very interesting. Now that you have it somewhat in captivity, maybe you could determine some things: 1. Is it only stdout that's affected? What of other stdio streams? (Note that testing stderr might be tricky because it's probably line-buffered.) 2. Does an fflush() just before, or just after, WSACleanup() fix it? > I see no solution for this on the postgres side for now, but this > information about Windows quirks could be useful in case someone > stumbled upon it too. Depending on your answers to the above, maybe some hack like this would be acceptable: free(conn); #ifdef WIN32 + fflush(NULL); WSACleanup(); #endif } It's not very nice for a library to be doing global things like that, but if the alternative is loss of output, maybe we should. But wait a minute: I just looked at Microsoft's docs [1] and found In a multithreaded environment, WSACleanup terminates Windows Sockets operations for all threads. This makes me (a) wonder if that explains the side-effects on stdio, and (b) question why libpq is calling WSACleanup at all. What if we arranged to call WSAStartup just once, during the first libpq connection-open in a process, and then never did WSACleanup? Surely the OS can cope with that, and it eliminates any risk that WSACleanup breaks something. regards, tom lane [1] https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-wsacleanup
16.10.2020 19:18, Tom Lane wrote:
Oh, very interesting.stderr is affected too. Just replacing stdout with stderr in connect.c and 000_connect.pl gives the same result.Now that you have it somewhat in captivity, maybe you could determine some things: 1. Is it only stdout that's affected? What of other stdio streams? (Note that testing stderr might be tricky because it's probably line-buffered.)
Moreover, the following modification:
...
outfile = fopen("out", "w");
...
fprintf(stdout, "stdout test\n");
fprintf(stderr, "stderr test\n");
fprintf(outfile, "outfile test\n");
WSACleanup();
...
---
for (my $i =0; $i < 100000; $i++) {
unlink('out');
IPC::Run::run(\@cmd, '>', \$stdout, '2>', \$stderr);
open(my $fh, '<', 'out') or die $!;
my $fileout = <$fh>;
ok(defined $fileout && $fileout ne '');
close($fh);
}
detects similar failures too. (On a fail the out file exists but has zero size.)
"fflush(NULL);" just before or after WSACleanup() fixes things.2. Does an fflush() just before, or just after, WSACleanup() fix it?
I've managed to record in ProcMon the activity log for a failed run (aside normal runs). Excerpts from the log are attached. As we can see, the failed process doesn't even try to write into IPC-Run's temp file.
But now we see that the WSACleanup call is a global thing by itself.Depending on your answers to the above, maybe some hack like this would be acceptable: free(conn); #ifdef WIN32 + fflush(NULL); WSACleanup(); #endif } It's not very nice for a library to be doing global things like that, but if the alternative is loss of output, maybe we should.
What bothers me is:But wait a minute: I just looked at Microsoft's docs [1] and found In a multithreaded environment, WSACleanup terminates Windows Sockets operations for all threads. This makes me (a) wonder if that explains the side-effects on stdio, and (b) question why libpq is calling WSACleanup at all. What if we arranged to call WSAStartup just once, during the first libpq connection-open in a process, and then never did WSACleanup? Surely the OS can cope with that, and it eliminates any risk that WSACleanup breaks something.
There must be a call to WSACleanup for each successful call to WSAStartup. Only the final WSACleanup function call performs the actual cleanup. The preceding calls simply decrement an internal reference count in the WS2_32.DLL.So third-party application developers should understand that when using libpq they would have to call WSACleanup one more time to perform "the actual cleanup". (And thus WSAStartup is kind of like a global thing too.)
But may be it's a way better than to have a confirmed risk of losing data.
Best regards,
Alexander
Attachment
On 10/16/20 12:18 PM, Tom Lane wrote: > > But wait a minute: I just looked at Microsoft's docs [1] and found > > In a multithreaded environment, WSACleanup terminates Windows Sockets > operations for all threads. > > This makes me (a) wonder if that explains the side-effects on stdio, > and (b) question why libpq is calling WSACleanup at all. > What if we arranged to call WSAStartup just once, during the first > libpq connection-open in a process, and then never did WSACleanup? > Surely the OS can cope with that, and it eliminates any risk that > WSACleanup breaks something. > > regards, tom lane > > [1] https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-wsacleanup > > This could explain random transient stdout/stderr failures we have seen over the years. I think we should at least give your suggestion a try - this is a good time in the dev cycle for such experiments. cheers andrew
Alexander Lakhin <exclusion@gmail.com> writes: > 16.10.2020 19:18, Tom Lane wrote: >> What if we arranged to call WSAStartup just once, during the first >> libpq connection-open in a process, and then never did WSACleanup? >> Surely the OS can cope with that, and it eliminates any risk that >> WSACleanup breaks something. > What bothers me is: > There must be a call to *WSACleanup* for each successful call to > WSAStartup > <https://docs.microsoft.com/en-us/windows/desktop/api/winsock/nf-winsock-wsastartup>. Yeah, that is a very odd statement. Surely, the Windows kernel manages to cope if a program crashes without having done that. So what exactly is the downside of intentionally not doing it? There's no reason to care if the Winsock DLL stays loaded until program exit rather than getting unloaded a bit earlier. (If anything, the current code causes an unload/reload cycle for each connection when the application makes a series of PG connections; who could think that's a great idea?) regards, tom lane
I wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> What bothers me is: >> There must be a call to *WSACleanup* for each successful call to >> WSAStartup >> <https://docs.microsoft.com/en-us/windows/desktop/api/winsock/nf-winsock-wsastartup>. > Yeah, that is a very odd statement. Surely, the Windows kernel manages > to cope if a program crashes without having done that. So what exactly > is the downside of intentionally not doing it? A bit of grepping showed me that the backend, initdb, and pg_regress all call WSAStartup without ever doing WSACleanup, and we've seen no ill effects from that. So it seems clear that this documentation can safely be ignored. I propose the attached patch. If this doesn't cause buildfarm problems, perhaps we should back-patch it. BTW, I notice that libpq is asking WSAStartup for Winsock version 1.1, which is remarkably ancient. Almost everyplace else is asking for version 2.2, which has been current for a decade or two. Shouldn't we update that? (It occurs to me to wonder if this in itself is some kind of problem; I wonder how well Winsock works when there are requests for different API versions in the same program.) regards, tom lane diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 3315f1dd05..de60281fcb 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -91,21 +91,6 @@ </para> </warning> - <note> - <para> - On Windows, there is a way to improve performance if a single - database connection is repeatedly started and shutdown. Internally, - libpq calls <function>WSAStartup()</function> and <function>WSACleanup()</function> for connection startup - and shutdown, respectively. <function>WSAStartup()</function> increments an internal - Windows library reference count which is decremented by <function>WSACleanup()</function>. - When the reference count is just one, calling <function>WSACleanup()</function> frees - all resources and all DLLs are unloaded. This is an expensive - operation. To avoid this, an application can manually call - <function>WSAStartup()</function> so resources will not be freed when the last database - connection is closed. - </para> - </note> - <variablelist> <varlistentry id="libpq-PQconnectdbParams"> <term><function>PQconnectdbParams</function><indexterm><primary>PQconnectdbParams</primary></indexterm></term> diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c index a967e11378..b51cc76c7d 100644 --- a/src/bin/pg_dump/parallel.c +++ b/src/bin/pg_dump/parallel.c @@ -229,19 +229,6 @@ static char *readMessageFromPipe(int fd); (strncmp(msg, prefix, strlen(prefix)) == 0) -/* - * Shutdown callback to clean up socket access - */ -#ifdef WIN32 -static void -shutdown_parallel_dump_utils(int code, void *unused) -{ - /* Call the cleanup function only from the main thread */ - if (mainThreadId == GetCurrentThreadId()) - WSACleanup(); -} -#endif - /* * Initialize parallel dump support --- should be called early in process * startup. (Currently, this is called whether or not we intend parallel @@ -267,8 +254,7 @@ init_parallel_dump_utils(void) pg_log_error("WSAStartup failed: %d", err); exit_nicely(1); } - /* ... and arrange to shut it down at exit */ - on_exit_nicely(shutdown_parallel_dump_utils, NULL); + parallel_init_done = true; } #endif diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index af27fee6b5..704c9e2f79 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -3871,23 +3871,30 @@ makeEmptyPGconn(void) #ifdef WIN32 /* - * Make sure socket support is up and running. + * Make sure socket support is up and running in this process. + * + * Note: the Windows documentation says that we should eventually do a + * matching WSACleanup() call, but experience suggests that that is at + * least as likely to cause problems as fix them. So we don't. */ - WSADATA wsaData; + static bool wsastartup_done = false; - if (WSAStartup(MAKEWORD(1, 1), &wsaData)) - return NULL; + if (!wsastartup_done) + { + WSADATA wsaData; + + if (WSAStartup(MAKEWORD(1, 1), &wsaData) != 0) + return NULL; + wsastartup_done = true; + } + + /* Forget any earlier error */ WSASetLastError(0); -#endif +#endif /* WIN32 */ conn = (PGconn *) malloc(sizeof(PGconn)); if (conn == NULL) - { -#ifdef WIN32 - WSACleanup(); -#endif return conn; - } /* Zero all pointers and booleans */ MemSet(conn, 0, sizeof(PGconn)); @@ -4080,10 +4087,6 @@ freePGconn(PGconn *conn) termPQExpBuffer(&conn->workBuffer); free(conn); - -#ifdef WIN32 - WSACleanup(); -#endif } /*
17.10.2020 21:44, Tom Lane wrote: > I propose the attached patch. If this doesn't cause buildfarm problems, > perhaps we should back-patch it. Thank you! I've made a simple cmd script to reproduce problems seen on dory: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dory&br=HEAD FOR /L %%I IN (1,1,200) DO call :CHECK %%I GOTO :eof :CHECK echo iteration %1 call vcregress ecpgcheck IF %ERRORLEVEL% NEQ 0 GOTO ERR EXIT /B :ERR echo iteration %1 failed pause Without the fix I've got errors on iterations 43, 46, 128, 47, 14, 4, 27, which approximately corresponds to the ECPG-Check failure frequency on dory (for HEAD). With the fix all the 200 iterations passed as expected. Then I ran the loop again just to be sure and got: test thread/descriptor ... stderr FAILED 81 ms iteration 124 failed. diff -w -U3 .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr .../src/interfaces/ecpg/test/results/thread-descriptor.stderr --- .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr 2019-12-04 16:05:46 +0300 +++ .../src/interfaces/ecpg/test/results/thread-descriptor.stderr 2020-10-18 20:20:27 +0300 @@ -0,0 +1 @@ +SQL error: descriptor "mydesc" not found on line 31 It's interesting that all failures before the fix were with stdout, but this one is with stderr. I'm going to investigate this issue further. Best regards, Alexander
Alexander Lakhin <exclusion@gmail.com> writes: > With the fix all the 200 iterations passed as expected. > Then I ran the loop again just to be sure and got: > test thread/descriptor ... stderr FAILED 81 ms > iteration 124 failed. Sigh ... still, this: > diff -w -U3 > .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr > .../src/interfaces/ecpg/test/results/thread-descriptor.stderr > --- .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr > 2019-12-04 16:05:46 +0300 > +++ .../src/interfaces/ecpg/test/results/thread-descriptor.stderr > 2020-10-18 20:20:27 +0300 > @@ -0,0 +1 @@ > +SQL error: descriptor "mydesc" not found on line 31 does not look like the same kind of failure as what we've been dealing with up to now. So maybe what we've got is that we fixed the stdio loss problem, and now the error rate is down to the point where we can notice other, even-lower-probability issues. regards, tom lane
I wrote: >> diff -w -U3 >> .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr >> .../src/interfaces/ecpg/test/results/thread-descriptor.stderr >> --- .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr >> 2019-12-04 16:05:46 +0300 >> +++ .../src/interfaces/ecpg/test/results/thread-descriptor.stderr >> 2020-10-18 20:20:27 +0300 >> @@ -0,0 +1 @@ >> +SQL error: descriptor "mydesc" not found on line 31 > does not look like the same kind of failure as what we've been dealing > with up to now. So maybe what we've got is that we fixed the stdio > loss problem, and now the error rate is down to the point where we can > notice other, even-lower-probability issues. Yeah, I think so. I grepped the buildfarm logs for similar failures and found three occurrences: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-02-03%2018%3A36%3A05 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-01-17%2014%3A30%3A07 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2020-01-02%2018%3A03%3A52 All of these are in the thread/descriptor test, failing at the deallocate step in for (i = 1; i <= REPEATS; ++i) { EXEC SQL ALLOCATE DESCRIPTOR mydesc; EXEC SQL DEALLOCATE DESCRIPTOR mydesc; } where the test is running several of these in different threads. I wonder whether there's some missing thread-locking in the ECPG descriptor support. It is odd though that we have seen this only on Windows members. Low-probability or not, you'd think we'd have some similar reports from non-Windows critters if it were possible. regards, tom lane
18.10.2020 21:04, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> @@ -0,0 +1 @@ >> +SQL error: descriptor "mydesc" not found on line 31 > does not look like the same kind of failure as what we've been dealing > with up to now. So maybe what we've got is that we fixed the stdio > loss problem, and now the error rate is down to the point where we can > notice other, even-lower-probability issues. Yes, in this case stderr is not missing (it's present with the error). So it's really different case. As is another one: test connect/test5 ... stderr FAILED 238 ms diff -w -U3 .../src/interfaces/ecpg/test/expected/connect-test5.stderr .../src/interfaces/ecpg/test/results/connect-test5.stderr --- .../src/interfaces/ecpg/test/expected/connect-test5.stderr 2020-10-13 21:51:14 +0300 +++ .../src/interfaces/ecpg/test/results/connect-test5.stderr 2020-10-18 20:59:46 +0300 @@ -73,7 +73,9 @@ [NO_PID]: sqlca: code: -220, state: 08003 [NO_PID]: ECPGconnect: opening database <DEFAULT> on <DEFAULT> port <DEFAULT> for user regress_ecpg_user2 [NO_PID]: sqlca: code: 0, state: 00000 -[NO_PID]: ECPGconnect: could not open database: FATAL: database "regress_ecpg_user2" does not exist +[NO_PID]: ECPGconnect: could not open database: server closed the connection unexpectedly + This probably means the server terminated abnormally + before or while processing the request. and the server.log: 2020-10-18 20:59:45.731 MSK client backend[1380] ecpg/connect-test4 LOG: could not receive data from client: An existing connection was forcibly closed by the remote host. 2020-10-18 20:59:45.898 MSK client backend[2884] [unknown] FATAL: database "regress_ecpg_user2" does not exist 2020-10-18 20:59:45.992 MSK client backend[1640] [unknown] FATAL: database "regress_ecpg_user2" does not exist I just wanted to inform that the ECPG-test failures can still persist in the buildfarm, unfortunately. Best regards, Alexander
Alexander Lakhin <exclusion@gmail.com> writes: > I've made a simple cmd script to reproduce problems seen on dory: > ... > Without the fix I've got errors on iterations 43, 46, 128, 47, 14, 4, > 27, which approximately corresponds to the ECPG-Check failure frequency > on dory (for HEAD). > With the fix all the 200 iterations passed as expected. > Then I ran the loop again just to be sure and got: > test thread/descriptor ... stderr FAILED 81 ms > iteration 124 failed. I had been thinking we'd have to wait a month or two for the buildfarm to accumulate enough runs to be confident in whether the WSACleanup removal fixes the ecpg failures. However, now that you did this experiment, I think we have enough evidence already that it fixes it (or at least makes things an order of magnitude better). So now I'm inclined to not wait, but go ahead and backpatch 7d00a6b2d now. There's still enough time before the November releases that we can expect that any nasty problems will show up in the buildfarm before we ship. regards, tom lane
Alexander Lakhin <exclusion@gmail.com> writes: > I just wanted to inform that the ECPG-test failures can still persist in > the buildfarm, unfortunately. Right, but at least now we can see that there are other issues to investigate. Personally I stopped paying any attention to buildfarm ECPG failures on Windows some time ago, figuring they were all the mysterious stdout-truncation problem. With that gone there'll be less noise and more signal in the buildfarm results. regards, tom lane