Hello hackers,
13.09.2020 21:37, Tom Lane wrote:
> I happened to try googling for other similar reports, and I found
> a very interesting recent thread here:
>
> https://github.com/nodejs/node/issues/33166
>
> It might not have the same underlying cause, of course, but it sure
> sounds familiar. If Node.js are really seeing the same effect,
> that would point to an underlying Windows bug rather than anything
> Postgres is doing wrong.
>
> It doesn't look like the Node.js crew got any closer to
> understanding the issue than we have, unfortunately. They made
> their problem mostly go away by reverting a seemingly-unrelated
> patch. But I can't help thinking that it's a timing-related bug,
> and that patch was just unlucky enough to change the timing of
> their tests so that they saw the failure frequently.
I've managed to make a simple reproducer. Please look at the patch attached.
There are two things crucial for reproducing the bug:
ioctlsocket(sock, FIONBIO, &ioctlsocket_ret); // from pgwin32_socket()
and
WSACleanup();
I still can't understand what affects the effect. With this reproducer I
get:
vcregress taptest src\test\modules\connect
...
t/000_connect.pl .. # test
#
t/000_connect.pl .. 13346/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 16714/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 26216/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 30077/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 36505/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 43647/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 53070/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 54402/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 55685/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 83193/100000
# Failed test at t/000_connect.pl line 24.
t/000_connect.pl .. 99992/100000 # Looks like you failed 10 tests of 100000.
t/000_connect.pl .. Dubious, test returned 10 (wstat 2560, 0xa00)
Failed 10/100000 subtests
But in our test farm the pg_bench test (from the installcheck-world
suite that we run with using msys) can fail roughly on each third run.
Perhaps it depends on I/O load. It seems, that searching files/scanning
disk in parallel increases the probability of the glitch.
I see no solution for this on the postgres side for now, but this
information about Windows quirks could be useful in case someone
stumbled upon it too.
Best regards,
Alexander