Re: A couple of random BF failures in kerberosCheck - Mailing list pgsql-hackers

From Tom Lane
Subject Re: A couple of random BF failures in kerberosCheck
Date
Msg-id 21009.1564866256@sss.pgh.pa.us
Whole thread Raw
In response to A couple of random BF failures in kerberosCheck  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: A couple of random BF failures in kerberosCheck
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25

> I wondered if this might be like the recently fixed problem with slapd
> not being ready to handle requests yet, since we start up krb5kdc
> first and then don't do anything explicit to wait for it, but it
> doesn't look like an obvious failure to reach it.

I spent a bit of time trying to reproduce these failures, using a
Fedora 29 install that should pretty nearly match what crake is
running.  I didn't yet match the shown failures, but I did get this
after a bunch of attempts:

# Running: /usr/sbin/krb5kdc -P /home/tgl/pgsql/src/test/kerberos/tmp_check/krb5kdc.pid
Bail out!  system /usr/sbin/krb5kdc failed

Looking into the tmp_check/krb5kdc.log file finds

Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](info): setting up network...
krb5kdc: Address already in use - Cannot bind server socket on 127.0.0.1.55324
Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](Error): Failed setting up a TCP socket (for 127.0.0.1.55324)
krb5kdc: Address already in use - Error setting up network

So this leads to two points:

* kerberos/t/001_auth.pl just blithely assumes that it can pick
any random port above 48K and that's guaranteed to be free.
Maybe we should split out the code in get_new_node for finding
a free TCP port, so we can call it here?

* AFAICS, the only provision for shutting down krb5kdc at the end of
the test run is

END
{
    kill 'INT', `cat $kdc_pidfile` if -f $kdc_pidfile;
}

I wonder how reliable that is, especially in contexts where the calling
script might do "rm -rf tmp_check" shortly afterwards.  Maybe it'd be
better to try to shut down krb5kdc explicitly before we exit the test
script.  I'd suggest waiting for krb5kdc to remove its pidfile, except
it seems not to do so :-(

Despite my suspicions about the shutdown provisions, I found that this is
somewhat reproducible and the problematic port is *not* the one assigned
in the previous iteration of 001_auth.pl.  However, I notice that after
running this test in a loop for awhile, there are an awful lot of local
loopback connections in TIME_WAIT state.  I hypothesize that the failures
correspond to cases where we try to re-use a port number that some
previous test iteration used, possibly on the client side not the server
side of GSS.  I wonder whether we are doing something that keeps those
GSS query connections from being closed more cleanly/rapidly.  (The
sockets do go away after a minute or so, but why are they in TIME_WAIT
at all?)

None of these points seem to explain the buildfarm failures, though,
especially not elver's where only one connection attempt failed.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: partition routing layering in nodeModifyTable.c
Next
From: Tomas Vondra
Date:
Subject: Re: Redacting information from logs