Re: Why is citext/regress failing on hamerkop? - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Why is citext/regress failing on hamerkop? |
Date | |
Msg-id | 847814.1715631450@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Why is citext/regress failing on hamerkop? (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Why is citext/regress failing on hamerkop?
|
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> writes: > For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so > here's hoping for 100% green on master by Tuesday or so. In the meantime, some off-list investigation by Alexander Lakhin has turned up a good deal of information about why we're seeing failures on hamerkop in the back branches. Summarizing, it appears that 1. In a GSS-enabled Windows build without any active Kerberos server, libpq's pg_GSS_have_cred_cache() succeeds, allowing libpq to try to open a GSSAPI connection, but then gss_init_sec_context() fails, leading to client-side reports like this: +Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure. Minor code may provide more information:Credential cache is empty +FATAL: sorry, too many clients already (The first of these lines comes out during the attempted GSS connection, the second during the only-slightly-more-successful non-GSS connection.) So that's problem number 1: how is it that gss_acquire_cred() succeeds but then gss_init_sec_context() disclaims knowledge of any credentials? Can we find a way to get this failure to be detected during pg_GSS_have_cred_cache()? It is mighty expensive to launch a backend connection that is doomed to fail, especially when this happens during *every single libpq connection attempt*. 2. Once gss_init_sec_context() fails, libpq abandons the connection and starts over; since it has already initiated a GSS handshake on the connection, there's not much choice. Although libpq faithfully issues closesocket() on the abandoned connection, Alexander found that the connected backend doesn't reliably see that: it may just sit there until the AuthenticationTimeout elapses (1 minute by default). That backend is still consuming a postmaster child slot, so if this happens on any sizable fraction of test connection attempts, it's little surprise that we soon get "sorry, too many clients already" failures. 3. We don't know exactly why hamerkop suddenly started seeing these failures, but a plausible theory emerges after noting that its reported time for the successful "make check" step dropped pretty substantially right when this started. In the v13 branch, "make check" was taking 2:18 or more in the several runs right before the first isolationcheck failure, but 1:40 or less just after. So it looks like the animal was moved onto faster hardware. That feeds into this problem because, with a successful isolationcheck run taking more than a minute, there was enough time for some of the earlier stuck sessions to time out and exit before their postmaster-child slots were needed. Alexander confirmed this theory by demonstrating that the main regression tests in v15 would pass if he limited their speed enough (by reducing the CPU allowed to a VM) but not at full speed. So the buildfarm results suggesting this is only an issue in <= v13 must be just a timing artifact; the problem is still there. Of course, backends waiting till timeout is not a good behavior independently of what is triggering that, so we have two problems to solve here, not one. I have no ideas about the gss_init_sec_context() failure, but I see a plausible theory about the failure to detect socket closure in Microsoft's documentation about closesocket() [1]: If the l_onoff member of the LINGER structure is zero on a stream socket, the closesocket call will return immediately and does not receive WSAEWOULDBLOCK whether the socket is blocking or nonblocking. However, any data queued for transmission will be sent, if possible, before the underlying socket is closed. This is also called a graceful disconnect or close. In this case, the Windows Sockets provider cannot release the socket and other resources for an arbitrary period, thus affecting applications that expect to use all available sockets. This is the default behavior for a socket. I'm not sure whether we've got unsent data pending in the problematic condition, nor why it'd remain unsent if we do (shouldn't the backend consume it anyway?). But this has the right odor for an explanation. I'm pretty hesitant to touch this area myself, because it looks an awful lot like commits 6051857fc and ed52c3707, which eventually had to be reverted. I think we need a deeper understanding of exactly what Winsock is doing or not doing before we try to fix it. I wonder if there are any Microsoft employees around here with access to the relevant source code. In the short run, it might be a good idea to deprecate using --with-gssapi on Windows builds. A different stopgap idea could be to drastically reduce the default AuthenticationTimeout, perhaps only on Windows. regards, tom lane [1] https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket
pgsql-hackers by date: