On Tue, May 14, 2024 at 8:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not sure whether we've got unsent data pending in the problematic
> condition, nor why it'd remain unsent if we do (shouldn't the backend
> consume it anyway?). But this has the right odor for an explanation.
>
> I'm pretty hesitant to touch this area myself, because it looks an
> awful lot like commits 6051857fc and ed52c3707, which eventually
> had to be reverted. I think we need a deeper understanding of
> exactly what Winsock is doing or not doing before we try to fix it.
I was beginning to suspect that lingering odour myself. I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect. I wonder
if the GSS code is somehow more prone to brokenness. One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again. That's what
https://commitfest.postgresql.org/46/3523/
is intended to fix. Does it help here? Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).