Re: Why is citext/regress failing on hamerkop? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Why is citext/regress failing on hamerkop?
Date
Msg-id 847814.1715631450@sss.pgh.pa.us
Whole thread Raw
In response to Re: Why is citext/regress failing on hamerkop?  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Why is citext/regress failing on hamerkop?
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> For citext_utf8, I pushed cff4e5a3.  Hamerkop runs infrequently, so
> here's hoping for 100% green on master by Tuesday or so.

In the meantime, some off-list investigation by Alexander Lakhin
has turned up a good deal of information about why we're seeing
failures on hamerkop in the back branches.  Summarizing, it
appears that

1. In a GSS-enabled Windows build without any active Kerberos server,
libpq's pg_GSS_have_cred_cache() succeeds, allowing libpq to try to
open a GSSAPI connection, but then gss_init_sec_context() fails,
leading to client-side reports like this:

+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure.  Minor code may provide more
information:Credential cache is empty 
+FATAL:  sorry, too many clients already

(The first of these lines comes out during the attempted GSS
connection, the second during the only-slightly-more-successful
non-GSS connection.)  So that's problem number 1: how is it that
gss_acquire_cred() succeeds but then gss_init_sec_context() disclaims
knowledge of any credentials?  Can we find a way to get this failure
to be detected during pg_GSS_have_cred_cache()?  It is mighty
expensive to launch a backend connection that is doomed to fail,
especially when this happens during *every single libpq connection
attempt*.

2. Once gss_init_sec_context() fails, libpq abandons the connection
and starts over; since it has already initiated a GSS handshake
on the connection, there's not much choice.  Although libpq faithfully
issues closesocket() on the abandoned connection, Alexander found
that the connected backend doesn't reliably see that: it may just
sit there until the AuthenticationTimeout elapses (1 minute by
default).  That backend is still consuming a postmaster child slot,
so if this happens on any sizable fraction of test connection
attempts, it's little surprise that we soon get "sorry, too many
clients already" failures.

3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started.  In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after.  So it
looks like the animal was moved onto faster hardware.  That feeds
into this problem because, with a successful isolationcheck run
taking more than a minute, there was enough time for some of the
earlier stuck sessions to time out and exit before their
postmaster-child slots were needed.

Alexander confirmed this theory by demonstrating that the main
regression tests in v15 would pass if he limited their speed enough
(by reducing the CPU allowed to a VM) but not at full speed.  So the
buildfarm results suggesting this is only an issue in <= v13 must
be just a timing artifact; the problem is still there.

Of course, backends waiting till timeout is not a good behavior
independently of what is triggering that, so we have two problems to
solve here, not one.  I have no ideas about the gss_init_sec_context()
failure, but I see a plausible theory about the failure to detect
socket closure in Microsoft's documentation about closesocket() [1]:

    If the l_onoff member of the LINGER structure is zero on a stream
    socket, the closesocket call will return immediately and does not
    receive WSAEWOULDBLOCK whether the socket is blocking or
    nonblocking. However, any data queued for transmission will be
    sent, if possible, before the underlying socket is closed. This is
    also called a graceful disconnect or close. In this case, the
    Windows Sockets provider cannot release the socket and other
    resources for an arbitrary period, thus affecting applications
    that expect to use all available sockets. This is the default
    behavior for a socket.

I'm not sure whether we've got unsent data pending in the problematic
condition, nor why it'd remain unsent if we do (shouldn't the backend
consume it anyway?).  But this has the right odor for an explanation.

I'm pretty hesitant to touch this area myself, because it looks an
awful lot like commits 6051857fc and ed52c3707, which eventually
had to be reverted.  I think we need a deeper understanding of
exactly what Winsock is doing or not doing before we try to fix it.
I wonder if there are any Microsoft employees around here with
access to the relevant source code.

In the short run, it might be a good idea to deprecate using
--with-gssapi on Windows builds.  A different stopgap idea
could be to drastically reduce the default AuthenticationTimeout,
perhaps only on Windows.

            regards, tom lane

[1] https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: race condition in pg_class
Next
From: Robert Haas
Date:
Subject: Re: New GUC autovacuum_max_threshold ?