Windows socket problems, interesting connection to AIO - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Windows socket problems, interesting connection to AIO |
Date | |
Msg-id | CA+hUKGLR10ZqRCvdoRrkQusq75wF5=vEetRSs2_u1s+FAUosFQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Windows socket problems, interesting connection to AIO
|
List | pgsql-hackers |
There's a category[1] of random build farm/CI failures where Windows behaves differently and our stuff breaks, which also affects end users. A recent BF failure[2] that looks like one of those jangled my nerves when I pushed a commit, so I looked into a new theory on how to fix it. First, let me restate my understanding of the two categories of known message loss on Windows, since the information is scattered far and wide across many threads: 1. When a backend exits without closing the socket gracefully, which was briefly fixed[3] but later reverted because it broke something else, a Windows server's network stack might fail to send data that it had buffered but not yet physically sent[4]. The reason we reverted that and went back to abortive socket shutdown (ie just exit()) is that our WL_SOCKET_READABLE was buggy, and could miss FD_CLOSE events from graceful but not abortive shutdowns (which keep reporting themselves repeatedly, something to do with being an error state (?)). Sometimes a libpq socket we're waiting for with WaitLatchOrSocket() on the client end of the socket could hang forever. Concretely: a replication connection or postgres_fdw running inside another PostgreSQL server. We fixed that event loss, albeit in a gross kludgy way[5], because other ideas seemed too complicated (to wit, various ways to manage extra state associated with each socket, really hard to retro-fit in a satisfying way). Graceful shutdown should fix the race cases where the next thing the client calls is recv(), as far as I know. 2. If a Windows client tries to send() and gets an ECONNRESET/EPIPE error, then the network stack seems to drop already received data, so a following recv() will never see it. In other words, it depends on whether the application-level protocol is strictly request/response based, or has sequence points at which both ends might send(). AFAIK the main consequence for real users is that FATAL recovery conflict, idle termination, etc messages are not delivered to clients, leaving just "server closed the connection unexpectedly". I have wondered whether it might help to kludgify the Windows TCP code even more by doing an extra poll() for POLLRD before every single send(). "Hey network stack, before I try to send this message, is there anything the server wanted to tell me?", but I guess that must be racy because the goodbye message could arrive between poll() and send(). Annoyingly, I suspect it would *mostly* work. The new thought I had about the second category of problem is: if you use asynchronous networking APIs, then the kernel *can't* throw your data out, because it doesn't even have it. If the server's FATAL message arrives before the client calls send(), then the data is already written to user space memory and the I/O is marked as complete. If it arrives after, then there's no issue, because computers can't see into the future yet. That's my hypothesis, anyway. To try that, I started with a very simple program[6] on my local FreeBSD system that does a failing send, and tries synchronous and asynchronous recv(): === synchronous === send -> -1, error = 32 recv -> "FATAL: flux capacitor failed", error = 0 === posix aio === send -> -1, error = 32 async recv -> "FATAL: flux capacitor failed", error = 0 ... and then googled enough Windows-fu to translate it and run it on CI, and saw the known category 2 failure with the plain old synchronous version. The good news is that the async version sees the goodbye message: === synchronous === send -> 14, error = 0 recv -> "", error = 10054 === windows overlapped === send -> 14, error = 0 async recv -> "FATAL: flux capacitor failed", error = 0 That's not the same as a torture test for weird timings, and I have zero knowledge of the implementation of this stuff, but I currently can't imagine how it could possibly be implemented in any way that could give a different answer. Perhaps we could figure out a way to use that API to simulate synchronous recv() built on top of that stuff, but I think a more satisfying use of our time and energy would be to redesign all our networking code to do cross-platform AIO. I think that will mostly come down to a bunch of network buffer management redesign work. Anyway, I don't have anything concrete there, I just wanted to share this observation. [1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Miscellaneous_tests_fail_on_Windows_due_to_a_connection_closed_before_receiving_a_final_error_message [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-08-31%2007%3A54%3A58 [3] https://github.com/postgres/postgres/commit/6051857fc953a62db318329c4ceec5f9668fd42a [4] https://learn.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2 [5] https://github.com/postgres/postgres/commit/a8458f508a7a441242e148f008293128676df003 [6] https://github.com/macdice/hello-windows/blob/socket-hacking/test.c
pgsql-hackers by date: