Thread: ssl tests fail due to TCP port conflict
Hello hackers, As buildfarm shows, ssl tests are not stable enough when running via meson. I can see the following failures for the last 90 days: REL_16_STABLE: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-12%2023%3A15%3A50 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-21%2000%3A35%3A23 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-03-27%2011%3A15%3A31 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-04-16%2016%3A10%3A45 master: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-08%2011%3A19%3A42 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-11%2022%3A23%3A28 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-17%2023%3A03%3A50 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-20%2009%3A21%3A30 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-20%2016%3A53%3A27 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-04-07%2012%3A25%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-04-08%2019%3A50%3A13 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-04-19%2021%3A24%3A30 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-04-22%2006%3A17%3A13 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-04-29%2023%3A27%3A15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-30%2000%3A24%3A28 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-06-04%2011%3A20%3A07 All the failures are caused by the server inability to restart with a previously chosen random port on TCP/IP. For example: 2024-06-04 11:30:40.227 UTC [3373644][postmaster][:0] LOG: starting PostgreSQL 17beta1 on x86_64-linux, compiled by clang-13.0.1-11, 64-bit 2024-06-04 11:30:40.231 UTC [3373644][postmaster][:0] LOG: listening on Unix socket "/tmp/tUmT8ItNQ2/.s.PGSQL.60362" 2024-06-04 11:30:40.337 UTC [3373798][startup][:0] LOG: database system was shut down at 2024-06-04 11:21:25 UTC ... 2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] LOG: starting PostgreSQL 17beta1 on x86_64-linux, compiled by clang-13.0.1-11, 64-bit 2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] LOG: could not bind IPv4 address "127.0.0.1": Address already in use 2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] HINT: Is another postmaster already running on port 60362? If not, wait a few seconds and retry. 2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] WARNING: could not create listen socket for "127.0.0.1" I've managed to reproduce the failure locally with the following change: --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -149,7 +149,7 @@ INIT $ENV{PGDATABASE} = 'postgres'; # Tracking of last port value assigned to accelerate free port lookup. - $last_port_assigned = int(rand() * 16384) + 49152; + $last_port_assigned = int(rand() * 1024) + 49152; and multiplying one of the tests. for i in `seq 50`; do cp .../src/test/ssl/t/001_ssltests.pl \ .../src/test/ssl/t/001_ssltests_$i.pl; \ sed -E "s|('t/001_ssltests.pl',)|\1\n't/001_ssltests_$i.pl',|" -i \ .../src/test/ssl/meson.build; done Then `meson test --suite ssl` fails for me as below: ... 26/53 postgresql:ssl / ssl/001_ssltests_26 OK 9.03s 205 subtests passed 27/53 postgresql:ssl / ssl/001_ssltests_18 ERROR 3.55s (exit status 255 or signal 127 SIGinvalid) >>> OPENSSL=/usr/bin/openssl ... 28/53 postgresql:ssl / ssl/001_ssltests_25 OK 10.98s 205 subtests passed 29/53 postgresql:ssl / ssl/001_ssltests_24 OK 10.84s 205 subtests passed testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log contains: 2024-06-05 11:51:22.005 UTC [710541] LOG: could not bind IPv4 address "127.0.0.1": Address already in use 2024-06-05 11:51:22.005 UTC [710541] HINT: Is another postmaster already running on port 49632? If not, wait a few seconds and retry. `grep '\b49632\b' -r testrun/` finds: testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:# Checking port 49632 testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:# Found port 49632 testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:Connection string: port=49632 host=/tmp/sp_VLbpjJF testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log:2024-06-05 11:51:18.896 UTC [710082] LOG: listening on Unix socket "/tmp/sp_VLbpjJF/.s.PGSQL.49632" testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log:2024-06-05 11:51:22.005 UTC [710541] HINT: Is another postmaster already running on port 49632? If not, wait a few seconds and retry. ... testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:# Checking port 49632 testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:# Found port 49632 testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:Connection string: port=49632 host=/tmp/3lxVDNzuGC testrun/ssl/001_ssltests_23/log/001_ssltests_23_primary.log:2024-06-05 11:51:13.337 UTC [708377] LOG: listening on Unix socket "/tmp/3lxVDNzuGC/.s.PGSQL.49632" testrun/ssl/001_ssltests_23/log/001_ssltests_23_primary.log:2024-06-05 11:51:14.333 UTC [708715] LOG: listening on IPv4 address "127.0.0.1", port 49632 ... Another case (with psql using the port): testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Checking port 49448 testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found port 49448 testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 12:20:50.178 UTC [976826] LOG: listening on Unix socket "/tmp/GePu6gmouP/.s.PGSQL.49448" testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 12:20:50.491 UTC [976927] HINT: Is another postmaster already running on port 49448? If not, wait a few seconds and retry. ... testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG: connection received: host=localhost port=49448 The broader excerpt: 2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG: connection received: host=localhost port=50326 2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG: could not accept SSL connection: EOF detected 2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG: connection received: host=localhost port=49420 2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG: could not accept SSL connection: EOF detected 2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG: connection received: host=localhost port=49430 2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca 2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG: connection received: host=localhost port=49440 2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca 2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG: connection received: host=localhost port=49448 2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca 2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG: connection received: host=localhost port=49464 2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG: could not accept SSL connection: tlsv1 alert unknown ca 2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG: connection received: host=localhost port=49468 Maybe the ssl tests should not be considered failed in case of the TCP port binding error? Best regards, Alexander
On 2024-06-05 We 09:00, Alexander Lakhin wrote: > > Another case (with psql using the port): > testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Checking > port 49448 > testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found > port 49448 > testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 > 12:20:50.178 UTC [976826] LOG: listening on Unix socket > "/tmp/GePu6gmouP/.s.PGSQL.49448" > testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 > 12:20:50.491 UTC [976927] HINT: Is another postmaster already running > on port 49448? If not, wait a few seconds and retry. > ... > testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05 > 12:20:50.491 UTC [976943] [unknown] LOG: connection received: > host=localhost port=49448 > The broader excerpt: > 2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG: connection > received: host=localhost port=50326 > 2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG: could not accept > SSL connection: EOF detected > 2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG: connection > received: host=localhost port=49420 > 2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG: could not accept > SSL connection: EOF detected > 2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG: connection > received: host=localhost port=49430 > 2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG: could not accept > SSL connection: tlsv1 alert unknown ca > 2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG: connection > received: host=localhost port=49440 > 2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG: could not accept > SSL connection: tlsv1 alert unknown ca > 2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG: connection > received: host=localhost port=49448 > 2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG: could not accept > SSL connection: tlsv1 alert unknown ca > 2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG: connection > received: host=localhost port=49464 > 2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG: could not accept > SSL connection: tlsv1 alert unknown ca > 2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG: connection > received: host=localhost port=49468 I think I see what's going on here. It looks like it's because we start the server in unix socket mode, and then switch to using TCP as well. Can you try your test with this patch applied and see if the problem persists? If we start in TCP mode the framework should test for a port clash. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 2024-06-05 We 14:10, Andrew Dunstan wrote: > > On 2024-06-05 We 09:00, Alexander Lakhin wrote: >> >> Another case (with psql using the port): >> testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# >> Checking port 49448 >> testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found >> port 49448 >> testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 >> 12:20:50.178 UTC [976826] LOG: listening on Unix socket >> "/tmp/GePu6gmouP/.s.PGSQL.49448" >> testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 >> 12:20:50.491 UTC [976927] HINT: Is another postmaster already >> running on port 49448? If not, wait a few seconds and retry. >> ... >> testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05 >> 12:20:50.491 UTC [976943] [unknown] LOG: connection received: >> host=localhost port=49448 >> The broader excerpt: >> 2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG: connection >> received: host=localhost port=50326 >> 2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG: could not accept >> SSL connection: EOF detected >> 2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG: connection >> received: host=localhost port=49420 >> 2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG: could not accept >> SSL connection: EOF detected >> 2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG: connection >> received: host=localhost port=49430 >> 2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG: could not accept >> SSL connection: tlsv1 alert unknown ca >> 2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG: connection >> received: host=localhost port=49440 >> 2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG: could not accept >> SSL connection: tlsv1 alert unknown ca >> 2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG: connection >> received: host=localhost port=49448 >> 2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG: could not accept >> SSL connection: tlsv1 alert unknown ca >> 2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG: connection >> received: host=localhost port=49464 >> 2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG: could not accept >> SSL connection: tlsv1 alert unknown ca >> 2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG: connection >> received: host=localhost port=49468 > > > I think I see what's going on here. It looks like it's because we > start the server in unix socket mode, and then switch to using TCP as > well. > > Can you try your test with this patch applied and see if the problem > persists? If we start in TCP mode the framework should test for a port > clash. > > > Hmm, on closer inspection we should have reserved the port anyway. But why is the port "already used" on restart? We haven't previously opened a TCP connection on that port (except when checking if we can bind it), and instances should be locked against using that port. ... wanders away muttering and scratching head ... cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hello Andrew, 05.06.2024 21:10, Andrew Dunstan wrote: > > I think I see what's going on here. It looks like it's because we start the server in unix socket mode, and then > switch to using TCP as well. > > Can you try your test with this patch applied and see if the problem persists? If we start in TCP mode the framework > should test for a port clash. > It seems that the failure rate decreased (I guess the patch rules out the case with two servers choosing the same port), but I still got: 16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205 subtests passed 17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit status 255 or signal 127 SIGinvalid) 2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL 17beta1 on x86_64-linux, compiled by gcc-13.2.1, 64-bit 2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 address "127.0.0.1": Address already in use 2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster already running on port 50072? If not, wait a few seconds and retry. `grep '\b50072\b' -r testrun/` yields: testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 19:40:37.392 UTC [414111] [unknown] LOG: connection received: host=localhost port=50072 (a psql case) That is, psql from the test instance 001_ssltests_34 opened a connection to the test server with the client port 50072 and it made using the port by the server from the test instance 001_ssltests_30 impossible. Best regards. Alexander
On 2024-06-05 We 16:00, Alexander Lakhin wrote: > Hello Andrew, > > 05.06.2024 21:10, Andrew Dunstan wrote: >> >> I think I see what's going on here. It looks like it's because we >> start the server in unix socket mode, and then switch to using TCP as >> well. >> >> Can you try your test with this patch applied and see if the problem >> persists? If we start in TCP mode the framework should test for a >> port clash. >> > > It seems that the failure rate decreased (I guess the patch rules out the > case with two servers choosing the same port), but I still got: > > 16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205 > subtests passed > 17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit > status 255 or signal 127 SIGinvalid) > > 2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL 17beta1 > on x86_64-linux, compiled by gcc-13.2.1, 64-bit > 2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 address > "127.0.0.1": Address already in use > 2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster > already running on port 50072? If not, wait a few seconds and retry. > > `grep '\b50072\b' -r testrun/` yields: > testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 > 19:40:37.392 UTC [414111] [unknown] LOG: connection received: > host=localhost port=50072 > (a psql case) > > That is, psql from the test instance 001_ssltests_34 opened a > connection to > the test server with the client port 50072 and it made using the port by > the server from the test instance 001_ssltests_30 impossible. > Oh. (kicks self) Should we really be allocating ephemeral server ports in the range 41952..65535? Maybe we should be looking for an unallocated port somewhere below 41952, and above, say, 32767, so we couldn't have a client socket collision. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-06-05 We 16:49, Andrew Dunstan wrote: > > On 2024-06-05 We 16:00, Alexander Lakhin wrote: >> Hello Andrew, >> >> 05.06.2024 21:10, Andrew Dunstan wrote: >>> >>> I think I see what's going on here. It looks like it's because we >>> start the server in unix socket mode, and then switch to using TCP >>> as well. >>> >>> Can you try your test with this patch applied and see if the problem >>> persists? If we start in TCP mode the framework should test for a >>> port clash. >>> >> >> It seems that the failure rate decreased (I guess the patch rules out >> the >> case with two servers choosing the same port), but I still got: >> >> 16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205 >> subtests passed >> 17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit >> status 255 or signal 127 SIGinvalid) >> >> 2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL >> 17beta1 on x86_64-linux, compiled by gcc-13.2.1, 64-bit >> 2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 >> address "127.0.0.1": Address already in use >> 2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster >> already running on port 50072? If not, wait a few seconds and retry. >> >> `grep '\b50072\b' -r testrun/` yields: >> testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 >> 19:40:37.392 UTC [414111] [unknown] LOG: connection received: >> host=localhost port=50072 >> (a psql case) >> >> That is, psql from the test instance 001_ssltests_34 opened a >> connection to >> the test server with the client port 50072 and it made using the port by >> the server from the test instance 001_ssltests_30 impossible. >> > > Oh. (kicks self) > > Should we really be allocating ephemeral server ports in the range > 41952..65535? Maybe we should be looking for an unallocated port > somewhere below 41952, and above, say, 32767, so we couldn't have a > client socket collision. Except that I see this on my Ubuntu instance: $ sudo sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 32768 60999 And indeed I see client sockets in that range. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-06-05 We 16:00, Alexander Lakhin wrote: >> That is, psql from the test instance 001_ssltests_34 opened a >> connection to >> the test server with the client port 50072 and it made using the port by >> the server from the test instance 001_ssltests_30 impossible. > Oh. (kicks self) D'oh. > Should we really be allocating ephemeral server ports in the range > 41952..65535? Maybe we should be looking for an unallocated port > somewhere below 41952, and above, say, 32767, so we couldn't have a > client socket collision. Hmm, are there really any standards about how these port numbers are used? I wonder if we don't need to just be prepared to retry the whole thing a few times. Even if it's true that "clients" shouldn't choose ports below 41952, we still have a small chance of failure against a non-Postgres server starting up at the wrong time. regards, tom lane
On 2024-06-05 We 17:37, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 2024-06-05 We 16:00, Alexander Lakhin wrote: >>> That is, psql from the test instance 001_ssltests_34 opened a >>> connection to >>> the test server with the client port 50072 and it made using the port by >>> the server from the test instance 001_ssltests_30 impossible. >> Oh. (kicks self) > D'oh. > >> Should we really be allocating ephemeral server ports in the range >> 41952..65535? Maybe we should be looking for an unallocated port >> somewhere below 41952, and above, say, 32767, so we couldn't have a >> client socket collision. > Hmm, are there really any standards about how these port numbers > are used? > > I wonder if we don't need to just be prepared to retry the whole > thing a few times. Even if it's true that "clients" shouldn't > choose ports below 41952, we still have a small chance of failure > against a non-Postgres server starting up at the wrong time. Yeah, I think you're right. One thing we should do is be careful to use the port as soon as possible after we have picked it, to reduce the possibility that something else will use it first. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-06-05 We 16:00, Alexander Lakhin wrote: > Hello Andrew, > > 05.06.2024 21:10, Andrew Dunstan wrote: >> >> I think I see what's going on here. It looks like it's because we >> start the server in unix socket mode, and then switch to using TCP as >> well. >> >> Can you try your test with this patch applied and see if the problem >> persists? If we start in TCP mode the framework should test for a >> port clash. >> > > It seems that the failure rate decreased (I guess the patch rules out the > case with two servers choosing the same port), but I still got: > > 16/53 postgresql:ssl / ssl/001_ssltests_36 OK 15.25s 205 > subtests passed > 17/53 postgresql:ssl / ssl/001_ssltests_30 ERROR 3.17s (exit > status 255 or signal 127 SIGinvalid) > > 2024-06-05 19:40:37.395 UTC [414110] LOG: starting PostgreSQL 17beta1 > on x86_64-linux, compiled by gcc-13.2.1, 64-bit > 2024-06-05 19:40:37.395 UTC [414110] LOG: could not bind IPv4 address > "127.0.0.1": Address already in use > 2024-06-05 19:40:37.395 UTC [414110] HINT: Is another postmaster > already running on port 50072? If not, wait a few seconds and retry. > > `grep '\b50072\b' -r testrun/` yields: > testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 > 19:40:37.392 UTC [414111] [unknown] LOG: connection received: > host=localhost port=50072 > (a psql case) > > That is, psql from the test instance 001_ssltests_34 opened a > connection to > the test server with the client port 50072 and it made using the port by > the server from the test instance 001_ssltests_30 impossible. > > After sleeping on it, I still think the patch would be a good thing. Your torture test might still show some failures, but the buildfarm isn't running those, and it might be enough to eliminate or at least substantially reduce buildfarm failures by reducing to almost zero the time in which a competing script might grab the port. The biggest problem with the current script is apparently that we delay using the TCP port by starting the server in Unix socket mode, and only switch to using TCP when we restart. If changing that doesn't fix the problem we'll have to rethink. If this isn't the cause, though, I would expect to have seen similar failures from other test suites. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, 5 Jun 2024 at 23:37, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > On 2024-06-05 We 16:00, Alexander Lakhin wrote: > >> That is, psql from the test instance 001_ssltests_34 opened a > >> connection to > >> the test server with the client port 50072 and it made using the port by > >> the server from the test instance 001_ssltests_30 impossible. > > > Oh. (kicks self) > > D'oh. > > > Should we really be allocating ephemeral server ports in the range > > 41952..65535? Maybe we should be looking for an unallocated port > > somewhere below 41952, and above, say, 32767, so we couldn't have a > > client socket collision. > > Hmm, are there really any standards about how these port numbers > are used? > > I wonder if we don't need to just be prepared to retry the whole > thing a few times. Even if it's true that "clients" shouldn't > choose ports below 41952, we still have a small chance of failure > against a non-Postgres server starting up at the wrong time. My suggestion would be to not touch the ephemeral port range at all for these ports. In practice the ephemeral port range is used for cases where the operating system assigns the port, and the application doesn't care whot it is. Not for when you want to get a free port, but want to know in advance which one it is. For the PgBouncer test suite we do something similar as the PG its perl tests do, but there we allocate a port between 10200 and 32768: https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215 Sure theoretically it's possible to hit a rare case where another server starts up at the wrong time, but that chance seems way lower than a client starting up at the wrong time. Especially since there aren't many servers that use a port with 5 digits. Attached is a patch that updates the port numbers.
Attachment
On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote: > On Wed, 5 Jun 2024 at 23:37, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> On 2024-06-05 We 16:00, Alexander Lakhin wrote: >>>> That is, psql from the test instance 001_ssltests_34 opened a >>>> connection to >>>> the test server with the client port 50072 and it made using the port by >>>> the server from the test instance 001_ssltests_30 impossible. >>> Oh. (kicks self) >> D'oh. >> >>> Should we really be allocating ephemeral server ports in the range >>> 41952..65535? Maybe we should be looking for an unallocated port >>> somewhere below 41952, and above, say, 32767, so we couldn't have a >>> client socket collision. >> Hmm, are there really any standards about how these port numbers >> are used? >> >> I wonder if we don't need to just be prepared to retry the whole >> thing a few times. Even if it's true that "clients" shouldn't >> choose ports below 41952, we still have a small chance of failure >> against a non-Postgres server starting up at the wrong time. > My suggestion would be to not touch the ephemeral port range at all > for these ports. In practice the ephemeral port range is used for > cases where the operating system assigns the port, and the application > doesn't care whot it is. Not for when you want to get a free port, but > want to know in advance which one it is. > > For the PgBouncer test suite we do something similar as the PG its > perl tests do, but there we allocate a port between 10200 and 32768: > https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215 > > Sure theoretically it's possible to hit a rare case where another > server starts up at the wrong time, but that chance seems way lower > than a client starting up at the wrong time. Especially since there > aren't many servers that use a port with 5 digits. > > Attached is a patch that updates the port numbers. Makes sense to me. I still think my patch to force TCP mode for the SSL test makes sense as well. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote: >> For the PgBouncer test suite we do something similar as the PG its >> perl tests do, but there we allocate a port between 10200 and 32768: >> https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215 > Makes sense to me. > I still think my patch to force TCP mode for the SSL test makes sense as > well. +1 to both things. If that doesn't get the failure rate down to an acceptable level, we can look at the retry idea. regards, tom lane
Hello, 07.06.2024 17:25, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I still think my patch to force TCP mode for the SSL test makes sense as >> well. > +1 to both things. If that doesn't get the failure rate down to an > acceptable level, we can look at the retry idea. I'd like to add that the kerberos/001_auth test also suffers from the port conflict, but slightly differently. Look for example at [1]: krb5kdc.log contains: Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): setting up network... Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Cannot bind server socket on 127.0.0.1.55853 Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Failed setting up a UDP socket (for 127.0.0.1.55853) Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Error setting up network As far as I can see, the port for kdc is chosen by PostgreSQL::Test::Kerberos, via PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP port availability (with can_bind()), but not for UDP, so this increases the probability of the conflict for this test (a similar failure: [2]). Although we can also find a failure with TCP: [3] (It's not clear to me, what processes can use UDP ports while testing, but maybe those buildfarm animals are running on the same logical machine simultaneously?) [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19 Best regards, Alexander
On 2024-07-08 Mo 8:00 AM, Alexander Lakhin wrote: > Hello, > > 07.06.2024 17:25, Tom Lane wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> I still think my patch to force TCP mode for the SSL test makes >>> sense as >>> well. >> +1 to both things. If that doesn't get the failure rate down to an >> acceptable level, we can look at the retry idea. I have push patches for both of those (i.e. start SSL test nodes in TCP mode and change the range of ports we allocate server ports from) I didn't see this email until after I had pushed them. > > I'd like to add that the kerberos/001_auth test also suffers from the > port > conflict, but slightly differently. Look for example at [1]: > krb5kdc.log contains: > Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): > setting up network... > Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): > Address already in use - Cannot bind server socket on 127.0.0.1.55853 > Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): > Failed setting up a UDP socket (for 127.0.0.1.55853) > Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): > Address already in use - Error setting up network > > As far as I can see, the port for kdc is chosen by > PostgreSQL::Test::Kerberos, via > PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP > port availability (with can_bind()), but not for UDP, so this increases > the probability of the conflict for this test (a similar failure: [2]). > Although we can also find a failure with TCP: [3] > > (It's not clear to me, what processes can use UDP ports while testing, > but maybe those buildfarm animals are running on the same logical > machine simultaneously?) > > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15 > [2] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07 > [3] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19 > > Let's see if this persists now we are testing for free ports in a different range, not the range usually used for ephemeral ports. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com