Thread: A couple of random BF failures in kerberosCheck

A couple of random BF failures in kerberosCheck

From
Thomas Munro
Date:
Hello,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25

I wondered if this might be like the recently fixed problem with slapd
not being ready to handle requests yet, since we start up krb5kdc
first and then don't do anything explicit to wait for it, but it
doesn't look like an obvious failure to reach it.  It looks like test
3 on elver connected successfully but didn't like the answer it got
for this query:

SELECT gss_authenticated AND encrypted from pg_stat_gssapi where pid =
pg_backend_pid();

-- 
Thomas Munro
https://enterprisedb.com



Re: A couple of random BF failures in kerberosCheck

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25

> I wondered if this might be like the recently fixed problem with slapd
> not being ready to handle requests yet, since we start up krb5kdc
> first and then don't do anything explicit to wait for it, but it
> doesn't look like an obvious failure to reach it.

I spent a bit of time trying to reproduce these failures, using a
Fedora 29 install that should pretty nearly match what crake is
running.  I didn't yet match the shown failures, but I did get this
after a bunch of attempts:

# Running: /usr/sbin/krb5kdc -P /home/tgl/pgsql/src/test/kerberos/tmp_check/krb5kdc.pid
Bail out!  system /usr/sbin/krb5kdc failed

Looking into the tmp_check/krb5kdc.log file finds

Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](info): setting up network...
krb5kdc: Address already in use - Cannot bind server socket on 127.0.0.1.55324
Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](Error): Failed setting up a TCP socket (for 127.0.0.1.55324)
krb5kdc: Address already in use - Error setting up network

So this leads to two points:

* kerberos/t/001_auth.pl just blithely assumes that it can pick
any random port above 48K and that's guaranteed to be free.
Maybe we should split out the code in get_new_node for finding
a free TCP port, so we can call it here?

* AFAICS, the only provision for shutting down krb5kdc at the end of
the test run is

END
{
    kill 'INT', `cat $kdc_pidfile` if -f $kdc_pidfile;
}

I wonder how reliable that is, especially in contexts where the calling
script might do "rm -rf tmp_check" shortly afterwards.  Maybe it'd be
better to try to shut down krb5kdc explicitly before we exit the test
script.  I'd suggest waiting for krb5kdc to remove its pidfile, except
it seems not to do so :-(

Despite my suspicions about the shutdown provisions, I found that this is
somewhat reproducible and the problematic port is *not* the one assigned
in the previous iteration of 001_auth.pl.  However, I notice that after
running this test in a loop for awhile, there are an awful lot of local
loopback connections in TIME_WAIT state.  I hypothesize that the failures
correspond to cases where we try to re-use a port number that some
previous test iteration used, possibly on the client side not the server
side of GSS.  I wonder whether we are doing something that keeps those
GSS query connections from being closed more cleanly/rapidly.  (The
sockets do go away after a minute or so, but why are they in TIME_WAIT
at all?)

None of these points seem to explain the buildfarm failures, though,
especially not elver's where only one connection attempt failed.

            regards, tom lane



Re: A couple of random BF failures in kerberosCheck

From
Tom Lane
Date:
I wrote:
> * kerberos/t/001_auth.pl just blithely assumes that it can pick
> any random port above 48K and that's guaranteed to be free.
> Maybe we should split out the code in get_new_node for finding
> a free TCP port, so we can call it here?

I've confirmed that the reason it's failing on my machine is exactly
that krb5kdc tries to bind to a socket that is still in TIME_WAIT state.
Also, it looks like the socket is typically one that was used by the
GSSAPI client side (no surprise, the test leaves a lot more of those
than the one server socket), so we'd have no record of it even if we
were somehow saving state from prior runs.

So I propose the attached patch, which seems to fix this for me.

The particular case I'm looking at (running these tests in a tight
loop) is of course not that interesting, but I argue that it's just
increasing the odds of failure enough that I can isolate the cause.
A buildfarm animal running both kerberos and ldap tests is almost
certainly at risk of such a failure with low probability.

(Still don't know what actually happened in those two buildfarm
failures, though.)

            regards, tom lane

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index 237b6fb..34845a7 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -69,7 +69,7 @@ my $krb5_conf   = "${TestLib::tmp_check}/krb5.conf";
 my $kdc_conf    = "${TestLib::tmp_check}/kdc.conf";
 my $krb5_log    = "${TestLib::tmp_check}/krb5libs.log";
 my $kdc_log     = "${TestLib::tmp_check}/krb5kdc.log";
-my $kdc_port    = int(rand() * 16384) + 49152;
+my $kdc_port    = get_free_port();
 my $kdc_datadir = "${TestLib::tmp_check}/krb5kdc";
 my $kdc_pidfile = "${TestLib::tmp_check}/krb5kdc.pid";
 my $keytab      = "${TestLib::tmp_check}/krb5.keytab";
diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl
index 84a3300..47bc090 100644
--- a/src/test/ldap/t/001_auth.pl
+++ b/src/test/ldap/t/001_auth.pl
@@ -55,7 +55,7 @@ my $slapd_pidfile = "${TestLib::tmp_check}/slapd.pid";
 my $slapd_logfile = "${TestLib::tmp_check}/slapd.log";
 my $ldap_conf     = "${TestLib::tmp_check}/ldap.conf";
 my $ldap_server   = 'localhost';
-my $ldap_port     = int(rand() * 16384) + 49152;
+my $ldap_port     = get_free_port();
 my $ldaps_port    = $ldap_port + 1;
 my $ldap_url      = "ldap://$ldap_server:$ldap_port";
 my $ldaps_url     = "ldaps://$ldap_server:$ldaps_port";
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37..1481160 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -63,6 +63,9 @@ PostgresNode - class representing PostgreSQL server instance
   # Stop the server
   $node->stop('fast');
 
+  # Find a free, unprivileged TCP port to bind some other service to
+  my $port = PostgresNode->get_free_port();
+
 =head1 DESCRIPTION
 
 PostgresNode contains a set of routines able to work on a PostgreSQL node,
@@ -102,6 +105,7 @@ use Scalar::Util qw(blessed);
 
 our @EXPORT = qw(
   get_new_node
+  get_free_port
 );
 
 our ($use_tcp, $test_localhost, $test_pghost, $last_host_assigned,
@@ -1071,9 +1075,68 @@ sub get_new_node
     my $class = 'PostgresNode';
     $class = shift if scalar(@_) % 2 != 1;
     my ($name, %params) = @_;
-    my $port_is_forced = defined $params{port};
-    my $found          = $port_is_forced;
-    my $port = $port_is_forced ? $params{port} : $last_port_assigned;
+
+    # Select a port.
+    my $port;
+    if (defined $params{port})
+    {
+        $port = $params{port};
+    }
+    else
+    {
+        # When selecting a port, we look for an unassigned TCP port number,
+        # even if we intend to use only Unix-domain sockets.  This is clearly
+        # necessary on $use_tcp (Windows) configurations, and it seems like a
+        # good idea on Unixen as well.
+        $port = get_free_port();
+    }
+
+    # Select a host.
+    my $host = $test_pghost;
+    if ($params{own_host})
+    {
+        if ($use_tcp)
+        {
+            $last_host_assigned++;
+            $last_host_assigned > 254 and BAIL_OUT("too many own_host nodes");
+            $host = '127.0.0.' . $last_host_assigned;
+        }
+        else
+        {
+            $host = "$test_pghost/$name"; # Assume $name =~ /^[-_a-zA-Z0-9]+$/
+            mkdir $host;
+        }
+    }
+
+    # Lock port number found by creating a new node
+    my $node = $class->new($name, $host, $port);
+
+    # Add node to list of nodes
+    push(@all_nodes, $node);
+
+    return $node;
+}
+
+=pod
+
+=item PostgresNode->get_free_port()
+
+Locate an unprivileged (high) TCP port that's not currently bound to
+anything.  This is used by get_new_node, and is also exported for use
+by test cases that need to start other, non-Postgres servers.
+
+Ports assigned to existing PostgresNode objects are automatically
+excluded, even if those servers are not currently running.
+
+XXX A port available now may become unavailable by the time we start
+the desired service.
+
+=cut
+
+sub get_free_port
+{
+    my $found = 0;
+    my $port  = $last_port_assigned;
 
     while ($found == 0)
     {
@@ -1090,24 +1153,18 @@ sub get_new_node
             $found = 0 if ($node->port == $port);
         }
 
-        # Check to see if anything else is listening on this TCP port.  This
-        # is *necessary* on $use_tcp (Windows) configurations.  Seek a port
-        # available for all possible listen_addresses values, for own_host
-        # nodes and so the caller can harness this port for the widest range
-        # of purposes.  The 0.0.0.0 test achieves that for post-2006 Cygwin,
-        # which automatically sets SO_EXCLUSIVEADDRUSE.  The same holds for
-        # MSYS (a Cygwin fork).  Testing 0.0.0.0 is insufficient for Windows
-        # native Perl (https://stackoverflow.com/a/14388707), so we also test
+        # Check to see if anything else is listening on this TCP port.
+        # Seek a port available for all possible listen_addresses values,
+        # so callers can harness this port for the widest range of purposes.
+        # The 0.0.0.0 test achieves that for post-2006 Cygwin, which
+        # automatically sets SO_EXCLUSIVEADDRUSE.  The same holds for MSYS (a
+        # Cygwin fork).  Testing 0.0.0.0 is insufficient for Windows native
+        # Perl (https://stackoverflow.com/a/14388707), so we also test
         # individual addresses.
         #
-        # This seems like a good idea on Unixen as well, even though we don't
-        # ask the postmaster to open a TCP port on Unix.  On Non-Linux,
-        # non-Windows kernels, binding to 127.0.0.1/24 addresses other than
-        # 127.0.0.1 might fail with EADDRNOTAVAIL.  Binding to 0.0.0.0 is
-        # unnecessary on non-Windows systems.
-        #
-        # XXX A port available now may become unavailable by the time we start
-        # the postmaster.
+        # On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses
+        # other than 127.0.0.1 might fail with EADDRNOTAVAIL.  Binding to
+        # 0.0.0.0 is unnecessary on non-Windows systems.
         if ($found == 1)
         {
             foreach my $addr (qw(127.0.0.1),
@@ -1120,33 +1177,10 @@ sub get_new_node
 
     print "# Found port $port\n";
 
-    # Select a host.
-    my $host = $test_pghost;
-    if ($params{own_host})
-    {
-        if ($use_tcp)
-        {
-            $last_host_assigned++;
-            $last_host_assigned > 254 and BAIL_OUT("too many own_host nodes");
-            $host = '127.0.0.' . $last_host_assigned;
-        }
-        else
-        {
-            $host = "$test_pghost/$name"; # Assume $name =~ /^[-_a-zA-Z0-9]+$/
-            mkdir $host;
-        }
-    }
-
-    # Lock port number found by creating a new node
-    my $node = $class->new($name, $host, $port);
-
-    # Add node to list of nodes
-    push(@all_nodes, $node);
+    # Update port for next time
+    $last_port_assigned = $port;
 
-    # And update port for next time
-    $port_is_forced or $last_port_assigned = $port;
-
-    return $node;
+    return $port;
 }
 
 # Internal routine to check whether a host:port is available to bind

Re: A couple of random BF failures in kerberosCheck

From
Andrew Dunstan
Date:
On 8/3/19 6:42 PM, Tom Lane wrote:
> I wrote:
>> * kerberos/t/001_auth.pl just blithely assumes that it can pick
>> any random port above 48K and that's guaranteed to be free.
>> Maybe we should split out the code in get_new_node for finding
>> a free TCP port, so we can call it here?
> I've confirmed that the reason it's failing on my machine is exactly
> that krb5kdc tries to bind to a socket that is still in TIME_WAIT state.
> Also, it looks like the socket is typically one that was used by the
> GSSAPI client side (no surprise, the test leaves a lot more of those
> than the one server socket), so we'd have no record of it even if we
> were somehow saving state from prior runs.
>
> So I propose the attached patch, which seems to fix this for me.
>
> The particular case I'm looking at (running these tests in a tight
> loop) is of course not that interesting, but I argue that it's just
> increasing the odds of failure enough that I can isolate the cause.
> A buildfarm animal running both kerberos and ldap tests is almost
> certainly at risk of such a failure with low probability.
>
> (Still don't know what actually happened in those two buildfarm
> failures, though.)
>
>             


Looks good. A couple of minor nits:


. since we're exporting the name there's no need to document it as a
class method. I'd remove the "PostgresNode->" from the couple of places
you have it in the docco. You're not actually calling it that way
anywhere, and indeed doing so ends up passing 'PostgresNode' as a
useless parameter to the subroutine. This is different from calling it
with a qualified name (PostgresNode::get_free_port()).

. in the inner loop we should probably exit the loop if we set found to
0. There's no point testing other addresses in that case. Something like
"last unless found;" would do the trick.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: A couple of random BF failures in kerberosCheck

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 8/3/19 6:42 PM, Tom Lane wrote:
>> So I propose the attached patch, which seems to fix this for me.

> Looks good. A couple of minor nits:

Will fix, thanks for the review!

            regards, tom lane