Re: A couple of random BF failures in kerberosCheck - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: A couple of random BF failures in kerberosCheck |
Date | |
Msg-id | 21009.1564866256@sss.pgh.pa.us Whole thread Raw |
In response to | A couple of random BF failures in kerberosCheck (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: A couple of random BF failures in kerberosCheck
|
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> writes: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25 > I wondered if this might be like the recently fixed problem with slapd > not being ready to handle requests yet, since we start up krb5kdc > first and then don't do anything explicit to wait for it, but it > doesn't look like an obvious failure to reach it. I spent a bit of time trying to reproduce these failures, using a Fedora 29 install that should pretty nearly match what crake is running. I didn't yet match the shown failures, but I did get this after a bunch of attempts: # Running: /usr/sbin/krb5kdc -P /home/tgl/pgsql/src/test/kerberos/tmp_check/krb5kdc.pid Bail out! system /usr/sbin/krb5kdc failed Looking into the tmp_check/krb5kdc.log file finds Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](info): setting up network... krb5kdc: Address already in use - Cannot bind server socket on 127.0.0.1.55324 Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](Error): Failed setting up a TCP socket (for 127.0.0.1.55324) krb5kdc: Address already in use - Error setting up network So this leads to two points: * kerberos/t/001_auth.pl just blithely assumes that it can pick any random port above 48K and that's guaranteed to be free. Maybe we should split out the code in get_new_node for finding a free TCP port, so we can call it here? * AFAICS, the only provision for shutting down krb5kdc at the end of the test run is END { kill 'INT', `cat $kdc_pidfile` if -f $kdc_pidfile; } I wonder how reliable that is, especially in contexts where the calling script might do "rm -rf tmp_check" shortly afterwards. Maybe it'd be better to try to shut down krb5kdc explicitly before we exit the test script. I'd suggest waiting for krb5kdc to remove its pidfile, except it seems not to do so :-( Despite my suspicions about the shutdown provisions, I found that this is somewhat reproducible and the problematic port is *not* the one assigned in the previous iteration of 001_auth.pl. However, I notice that after running this test in a loop for awhile, there are an awful lot of local loopback connections in TIME_WAIT state. I hypothesize that the failures correspond to cases where we try to re-use a port number that some previous test iteration used, possibly on the client side not the server side of GSS. I wonder whether we are doing something that keeps those GSS query connections from being closed more cleanly/rapidly. (The sockets do go away after a minute or so, but why are they in TIME_WAIT at all?) None of these points seem to explain the buildfarm failures, though, especially not elver's where only one connection attempt failed. regards, tom lane
pgsql-hackers by date: