ssl tests aren't concurrency safe due to get_free_port() - Mailing list pgsql-hackers

From Andres Freund
Subject ssl tests aren't concurrency safe due to get_free_port()
Date
Msg-id 20221002164931.d57hlutrcz4d2zi7@awork3.anarazel.de
Whole thread Raw
Responses Re: ssl tests aren't concurrency safe due to get_free_port()
List pgsql-hackers
Hi,

See e.g. https://cirrus-ci.com/task/4682373060100096
2022-10-01 15:15:21.849 UTC [41962][postmaster] LOG:  could not bind IPv4 address "127.0.0.1": Address already in use
2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT:  Is another postmaster already running on port 57003? If not,
waita few seconds and retry.
 
2022-10-01 15:15:21.849 UTC [41962][postmaster] WARNING:  could not create listen socket for "127.0.0.1"

I downloaded all test logs and grepping for 57003 shows the problem:

build/testrun/ldap/001_auth/log/regress_log_001_auth
3:# Checking port 57003
4:# Found port 57003
22:# Running: /usr/sbin/slapd -f /tmp/cirrus-ci-build/build/testrun/ldap/001_auth/data/slapd.conf -h
ldap://localhost:57002ldaps://localhost:57003
 

build/testrun/ldap/001_auth/log/001_auth_node.log
253:2022-10-01 15:15:25.103 UTC [42574][client backend] [[unknown]][3/1:0] DETAIL:  Connection matched pg_hba.conf line
1:"local all all ldap ldapurl="ldaps://localhost:57003/dc=example,dc=net??sub?(uid=$username)" ldaptls=1"
 

build/testrun/ssl/001_ssltests/log/regress_log_001_ssltests
2:# Checking port 57003
3:# Found port 57003
8:Connection string: port=57003 host=/tmp/1k5yhaWLQ1

build/testrun/ssl/001_ssltests/log/001_ssltests_primary.log
2:2022-10-01 15:15:20.668 UTC [41740][postmaster] LOG:  listening on Unix socket "/tmp/1k5yhaWLQ1/.s.PGSQL.57003"
58:2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT:  Is another postmaster already running on port 57003? If not,
waita few seconds and retry.
 


I.e. we chose the same port for slapd as part of ldap's 001_auth.pl as for the
postgres instance of ssl's 001_ssltests.pl.


I don't think get_free_port() has any chance of being reliable as is. It's
fundamentally racy just among concurrently running tests, without even
considering things external to the tests (given it's using the range of ports
auto-assigned for client tcp ports...).


The current code is from 803466b6ffa, which said:
    This isn't 100% bulletproof, since
    conceivably something else on the machine could grab the port between
    the time we check and the time we actually start the server.  But that's
    a pretty short window, so in practice this should be good enough.

but I've seen this fail a couple times, so I suspect it's unfortunately not
good enough.


I can see a few potential ways of improving the situation:

1) Improve the port we start the search for a unique port from. We use the
   following bit right now:

    # Tracking of last port value assigned to accelerate free port lookup.
    $last_port_assigned = int(rand() * 16384) + 49152;

   It *might* be less likely that we hit conflicts if we start the search on a
   port based on the pid, rather than rand(). We e.g., could start at
   something like (($$ * 16) % 16384 + 49152), giving a decent likelihood that
   each test has 16 free ports.

   Perhaps also worth to increase the range of ports searched?


2) Use a lockfile containing a pid to protect the choice of a port within a
   build directory. Before accepting a port get_free_port() would check if the
   a lockfile exists for the port and if so, if the test using it is still
   alive.  That will protect against racyness between multiple tests inside a
   build directory, but won't protect against races between multiple builds
   running concurrently on a machine (e.g. on a buildfarm host)


3) We could generate unique port ranges for each test and thus remove the
   chance of conflicts within a builddir and substantially reduce the
   likelihood of conflicts between build directories (as the conflict would be
   between two tests in different build directories, rather than all tests).

   This would be easy to do in the meson build, but nontrivial in autoconf
   afaics.


4) Add retries to the tests that need get_free_port(). That seems hard to get
   right, because every single restart of postgres (or some other service that
   we used get_free_port()) provides the potential for a new conflict.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: Re: Question: test "aggregates" failed in 32-bit machine
Next
From: Andres Freund
Date:
Subject: Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]