ssl tests aren't concurrency safe due to get_free_port() - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | ssl tests aren't concurrency safe due to get_free_port() |
Date | |
Msg-id | 20221002164931.d57hlutrcz4d2zi7@awork3.anarazel.de Whole thread Raw |
Responses |
Re: ssl tests aren't concurrency safe due to get_free_port()
|
List | pgsql-hackers |
Hi, See e.g. https://cirrus-ci.com/task/4682373060100096 2022-10-01 15:15:21.849 UTC [41962][postmaster] LOG: could not bind IPv4 address "127.0.0.1": Address already in use 2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT: Is another postmaster already running on port 57003? If not, waita few seconds and retry. 2022-10-01 15:15:21.849 UTC [41962][postmaster] WARNING: could not create listen socket for "127.0.0.1" I downloaded all test logs and grepping for 57003 shows the problem: build/testrun/ldap/001_auth/log/regress_log_001_auth 3:# Checking port 57003 4:# Found port 57003 22:# Running: /usr/sbin/slapd -f /tmp/cirrus-ci-build/build/testrun/ldap/001_auth/data/slapd.conf -h ldap://localhost:57002ldaps://localhost:57003 build/testrun/ldap/001_auth/log/001_auth_node.log 253:2022-10-01 15:15:25.103 UTC [42574][client backend] [[unknown]][3/1:0] DETAIL: Connection matched pg_hba.conf line 1:"local all all ldap ldapurl="ldaps://localhost:57003/dc=example,dc=net??sub?(uid=$username)" ldaptls=1" build/testrun/ssl/001_ssltests/log/regress_log_001_ssltests 2:# Checking port 57003 3:# Found port 57003 8:Connection string: port=57003 host=/tmp/1k5yhaWLQ1 build/testrun/ssl/001_ssltests/log/001_ssltests_primary.log 2:2022-10-01 15:15:20.668 UTC [41740][postmaster] LOG: listening on Unix socket "/tmp/1k5yhaWLQ1/.s.PGSQL.57003" 58:2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT: Is another postmaster already running on port 57003? If not, waita few seconds and retry. I.e. we chose the same port for slapd as part of ldap's 001_auth.pl as for the postgres instance of ssl's 001_ssltests.pl. I don't think get_free_port() has any chance of being reliable as is. It's fundamentally racy just among concurrently running tests, without even considering things external to the tests (given it's using the range of ports auto-assigned for client tcp ports...). The current code is from 803466b6ffa, which said: This isn't 100% bulletproof, since conceivably something else on the machine could grab the port between the time we check and the time we actually start the server. But that's a pretty short window, so in practice this should be good enough. but I've seen this fail a couple times, so I suspect it's unfortunately not good enough. I can see a few potential ways of improving the situation: 1) Improve the port we start the search for a unique port from. We use the following bit right now: # Tracking of last port value assigned to accelerate free port lookup. $last_port_assigned = int(rand() * 16384) + 49152; It *might* be less likely that we hit conflicts if we start the search on a port based on the pid, rather than rand(). We e.g., could start at something like (($$ * 16) % 16384 + 49152), giving a decent likelihood that each test has 16 free ports. Perhaps also worth to increase the range of ports searched? 2) Use a lockfile containing a pid to protect the choice of a port within a build directory. Before accepting a port get_free_port() would check if the a lockfile exists for the port and if so, if the test using it is still alive. That will protect against racyness between multiple tests inside a build directory, but won't protect against races between multiple builds running concurrently on a machine (e.g. on a buildfarm host) 3) We could generate unique port ranges for each test and thus remove the chance of conflicts within a builddir and substantially reduce the likelihood of conflicts between build directories (as the conflict would be between two tests in different build directories, rather than all tests). This would be easy to do in the meson build, but nontrivial in autoconf afaics. 4) Add retries to the tests that need get_free_port(). That seems hard to get right, because every single restart of postgres (or some other service that we used get_free_port()) provides the potential for a new conflict. Greetings, Andres Freund
pgsql-hackers by date: