Thread: ssl tests fail due to TCP port conflict

ssl tests fail due to TCP port conflict

From

Alexander Lakhin

Date:

05 June 2024, 13:00:00

Hello hackers,

As buildfarm shows, ssl tests are not stable enough when running via meson.
I can see the following failures for the last 90 days:
REL_16_STABLE:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-12%2023%3A15%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-21%2000%3A35%3A23
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-03-27%2011%3A15%3A31
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-04-16%2016%3A10%3A45

master:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-08%2011%3A19%3A42
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-11%2022%3A23%3A28
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-03-17%2023%3A03%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-20%2009%3A21%3A30
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-20%2016%3A53%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-04-07%2012%3A25%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-04-08%2019%3A50%3A13
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-04-19%2021%3A24%3A30
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-04-22%2006%3A17%3A13
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-04-29%2023%3A27%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-30%2000%3A24%3A28
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-06-04%2011%3A20%3A07

All the failures are caused by the server inability to restart with a
previously chosen random port on TCP/IP. For example:
2024-06-04 11:30:40.227 UTC [3373644][postmaster][:0] LOG:  starting PostgreSQL 17beta1 on x86_64-linux, compiled by 
clang-13.0.1-11, 64-bit
2024-06-04 11:30:40.231 UTC [3373644][postmaster][:0] LOG: listening on Unix socket "/tmp/tUmT8ItNQ2/.s.PGSQL.60362"
2024-06-04 11:30:40.337 UTC [3373798][startup][:0] LOG:  database system was shut down at 2024-06-04 11:21:25 UTC
...
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] LOG:  starting PostgreSQL 17beta1 on x86_64-linux, compiled by 
clang-13.0.1-11, 64-bit
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] LOG:  could not bind IPv4 address "127.0.0.1": Address already in
use
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] HINT:  Is another postmaster already running on port 60362? If 
not, wait a few seconds and retry.
2024-06-04 11:30:45.273 UTC [3376046][postmaster][:0] WARNING: could not create listen socket for "127.0.0.1"

I've managed to reproduce the failure locally with the following change:
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -149,7 +149,7 @@ INIT
         $ENV{PGDATABASE} = 'postgres';

         # Tracking of last port value assigned to accelerate free port lookup.
-       $last_port_assigned = int(rand() * 16384) + 49152;
+       $last_port_assigned = int(rand() * 1024) + 49152;

and multiplying one of the tests.
for i in `seq 50`; do cp .../src/test/ssl/t/001_ssltests.pl \
   .../src/test/ssl/t/001_ssltests_$i.pl; \
   sed -E "s|('t/001_ssltests.pl',)|\1\n't/001_ssltests_$i.pl',|" -i \
     .../src/test/ssl/meson.build; done

Then `meson test --suite ssl` fails for me as below:
...
26/53 postgresql:ssl / ssl/001_ssltests_26         OK 9.03s   205 subtests passed
27/53 postgresql:ssl / ssl/001_ssltests_18         ERROR 3.55s   (exit status 255 or signal 127 SIGinvalid)
 >>> OPENSSL=/usr/bin/openssl ...
28/53 postgresql:ssl / ssl/001_ssltests_25         OK 10.98s   205 subtests passed
29/53 postgresql:ssl / ssl/001_ssltests_24         OK 10.84s   205 subtests passed

testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log contains:
2024-06-05 11:51:22.005 UTC [710541] LOG:  could not bind IPv4 address "127.0.0.1": Address already in use
2024-06-05 11:51:22.005 UTC [710541] HINT:  Is another postmaster already running on port 49632? If not, wait a few 
seconds and retry.

`grep '\b49632\b' -r testrun/` finds:
testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:# Checking port 49632
testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:# Found port 49632
testrun/ssl/001_ssltests_18/log/regress_log_001_ssltests_18:Connection string: port=49632 host=/tmp/sp_VLbpjJF
testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log:2024-06-05 11:51:18.896 UTC [710082] LOG:  listening on
Unix
 
socket "/tmp/sp_VLbpjJF/.s.PGSQL.49632"
testrun/ssl/001_ssltests_18/log/001_ssltests_18_primary.log:2024-06-05 11:51:22.005 UTC [710541] HINT:  Is another 
postmaster already running on port 49632? If not, wait a few seconds and retry.
...
testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:# Checking port 49632
testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:# Found port 49632
testrun/ssl/001_ssltests_23/log/regress_log_001_ssltests_23:Connection string: port=49632 host=/tmp/3lxVDNzuGC
testrun/ssl/001_ssltests_23/log/001_ssltests_23_primary.log:2024-06-05 11:51:13.337 UTC [708377] LOG:  listening on
Unix
 
socket "/tmp/3lxVDNzuGC/.s.PGSQL.49632"
testrun/ssl/001_ssltests_23/log/001_ssltests_23_primary.log:2024-06-05 11:51:14.333 UTC [708715] LOG:  listening on
IPv4
 
address "127.0.0.1", port 49632
...

Another case (with psql using the port):
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Checking port 49448
testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found port 49448
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 12:20:50.178 UTC [976826] LOG:  listening on
Unix
 
socket "/tmp/GePu6gmouP/.s.PGSQL.49448"
testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 12:20:50.491 UTC [976927] HINT:  Is another 
postmaster already running on port 49448? If not, wait a few seconds and retry.
...
testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG:  
connection received: host=localhost port=49448
The broader excerpt:
2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG:  connection received: host=localhost port=50326
2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG:  could not accept SSL connection: EOF detected
2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG:  connection received: host=localhost port=49420
2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG:  could not accept SSL connection: EOF detected
2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG:  connection received: host=localhost port=49430
2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG:  could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG:  connection received: host=localhost port=49440
2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG:  could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG:  connection received: host=localhost port=49448
2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG:  could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG:  connection received: host=localhost port=49464
2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG:  could not accept SSL connection: tlsv1 alert unknown ca
2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG:  connection received: host=localhost port=49468

Maybe the ssl tests should not be considered failed in case of the TCP port
binding error?

Best regards,
Alexander

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

05 June 2024, 18:10:21

On 2024-06-05 We 09:00, Alexander Lakhin wrote:
>
> Another case (with psql using the port):
> testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Checking 
> port 49448
> testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found 
> port 49448
> testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 
> 12:20:50.178 UTC [976826] LOG:  listening on Unix socket 
> "/tmp/GePu6gmouP/.s.PGSQL.49448"
> testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 
> 12:20:50.491 UTC [976927] HINT:  Is another postmaster already running 
> on port 49448? If not, wait a few seconds and retry.
> ...
> testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05 
> 12:20:50.491 UTC [976943] [unknown] LOG:  connection received: 
> host=localhost port=49448
> The broader excerpt:
> 2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG:  connection 
> received: host=localhost port=50326
> 2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG:  could not accept 
> SSL connection: EOF detected
> 2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG:  connection 
> received: host=localhost port=49420
> 2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG:  could not accept 
> SSL connection: EOF detected
> 2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG:  connection 
> received: host=localhost port=49430
> 2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG:  could not accept 
> SSL connection: tlsv1 alert unknown ca
> 2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG:  connection 
> received: host=localhost port=49440
> 2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG:  could not accept 
> SSL connection: tlsv1 alert unknown ca
> 2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG:  connection 
> received: host=localhost port=49448
> 2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG:  could not accept 
> SSL connection: tlsv1 alert unknown ca
> 2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG:  connection 
> received: host=localhost port=49464
> 2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG:  could not accept 
> SSL connection: tlsv1 alert unknown ca
> 2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG:  connection 
> received: host=localhost port=49468


I think I see what's going on here. It looks like it's because we start 
the server in unix socket mode, and then switch to using TCP as well.

Can you try your test with this patch applied and see if the problem 
persists? If we start in TCP mode the framework should test for a port 
clash.


cheers


andrew

-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

ssl-test-tcp-fix.patch

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

05 June 2024, 19:03:00

On 2024-06-05 We 14:10, Andrew Dunstan wrote:
>
> On 2024-06-05 We 09:00, Alexander Lakhin wrote:
>>
>> Another case (with psql using the port):
>> testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# 
>> Checking port 49448
>> testrun/ssl/001_ssltests_47/log/regress_log_001_ssltests_47:# Found 
>> port 49448
>> testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 
>> 12:20:50.178 UTC [976826] LOG:  listening on Unix socket 
>> "/tmp/GePu6gmouP/.s.PGSQL.49448"
>> testrun/ssl/001_ssltests_47/log/001_ssltests_47_primary.log:2024-06-05 
>> 12:20:50.491 UTC [976927] HINT:  Is another postmaster already 
>> running on port 49448? If not, wait a few seconds and retry.
>> ...
>> testrun/ssl/001_ssltests_48/log/001_ssltests_48_primary.log:2024-06-05 
>> 12:20:50.491 UTC [976943] [unknown] LOG:  connection received: 
>> host=localhost port=49448
>> The broader excerpt:
>> 2024-06-05 12:20:50.415 UTC [976918] [unknown] LOG:  connection 
>> received: host=localhost port=50326
>> 2024-06-05 12:20:50.418 UTC [976918] [unknown] LOG:  could not accept 
>> SSL connection: EOF detected
>> 2024-06-05 12:20:50.433 UTC [976920] [unknown] LOG:  connection 
>> received: host=localhost port=49420
>> 2024-06-05 12:20:50.435 UTC [976920] [unknown] LOG:  could not accept 
>> SSL connection: EOF detected
>> 2024-06-05 12:20:50.447 UTC [976922] [unknown] LOG:  connection 
>> received: host=localhost port=49430
>> 2024-06-05 12:20:50.452 UTC [976922] [unknown] LOG:  could not accept 
>> SSL connection: tlsv1 alert unknown ca
>> 2024-06-05 12:20:50.466 UTC [976933] [unknown] LOG:  connection 
>> received: host=localhost port=49440
>> 2024-06-05 12:20:50.472 UTC [976933] [unknown] LOG:  could not accept 
>> SSL connection: tlsv1 alert unknown ca
>> 2024-06-05 12:20:50.491 UTC [976943] [unknown] LOG:  connection 
>> received: host=localhost port=49448
>> 2024-06-05 12:20:50.497 UTC [976943] [unknown] LOG:  could not accept 
>> SSL connection: tlsv1 alert unknown ca
>> 2024-06-05 12:20:50.513 UTC [976969] [unknown] LOG:  connection 
>> received: host=localhost port=49464
>> 2024-06-05 12:20:50.517 UTC [976969] [unknown] LOG:  could not accept 
>> SSL connection: tlsv1 alert unknown ca
>> 2024-06-05 12:20:50.532 UTC [976971] [unknown] LOG:  connection 
>> received: host=localhost port=49468
>
>
> I think I see what's going on here. It looks like it's because we 
> start the server in unix socket mode, and then switch to using TCP as 
> well.
>
> Can you try your test with this patch applied and see if the problem 
> persists? If we start in TCP mode the framework should test for a port 
> clash.
>
>
>

Hmm, on closer inspection we should have reserved the port anyway.  But 
why is the port "already used" on restart? We haven't previously opened 
a TCP connection on that port (except when checking if we can bind it), 
and instances should be locked against using that port.


  ... wanders away muttering and scratching head ...


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ssl tests fail due to TCP port conflict

From

Alexander Lakhin

Date:

05 June 2024, 20:00:01

Hello Andrew,

05.06.2024 21:10, Andrew Dunstan wrote:
>
> I think I see what's going on here. It looks like it's because we start the server in unix socket mode, and then 
> switch to using TCP as well.
>
> Can you try your test with this patch applied and see if the problem persists? If we start in TCP mode the framework

> should test for a port clash.
>

It seems that the failure rate decreased (I guess the patch rules out the
case with two servers choosing the same port), but I still got:

16/53 postgresql:ssl / ssl/001_ssltests_36         OK 15.25s   205 subtests passed
17/53 postgresql:ssl / ssl/001_ssltests_30         ERROR 3.17s   (exit status 255 or signal 127 SIGinvalid)

2024-06-05 19:40:37.395 UTC [414110] LOG:  starting PostgreSQL 17beta1 on x86_64-linux, compiled by gcc-13.2.1, 64-bit
2024-06-05 19:40:37.395 UTC [414110] LOG:  could not bind IPv4 address "127.0.0.1": Address already in use
2024-06-05 19:40:37.395 UTC [414110] HINT:  Is another postmaster already running on port 50072? If not, wait a few 
seconds and retry.

`grep '\b50072\b' -r testrun/` yields:
testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 19:40:37.392 UTC [414111] [unknown] LOG:  
connection received: host=localhost port=50072
(a psql case)

That is, psql from the test instance 001_ssltests_34 opened a connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.

Best regards.
Alexander

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

05 June 2024, 20:49:41

On 2024-06-05 We 16:00, Alexander Lakhin wrote:
> Hello Andrew,
>
> 05.06.2024 21:10, Andrew Dunstan wrote:
>>
>> I think I see what's going on here. It looks like it's because we 
>> start the server in unix socket mode, and then switch to using TCP as 
>> well.
>>
>> Can you try your test with this patch applied and see if the problem 
>> persists? If we start in TCP mode the framework should test for a 
>> port clash.
>>
>
> It seems that the failure rate decreased (I guess the patch rules out the
> case with two servers choosing the same port), but I still got:
>
> 16/53 postgresql:ssl / ssl/001_ssltests_36         OK 15.25s   205 
> subtests passed
> 17/53 postgresql:ssl / ssl/001_ssltests_30         ERROR 3.17s (exit 
> status 255 or signal 127 SIGinvalid)
>
> 2024-06-05 19:40:37.395 UTC [414110] LOG:  starting PostgreSQL 17beta1 
> on x86_64-linux, compiled by gcc-13.2.1, 64-bit
> 2024-06-05 19:40:37.395 UTC [414110] LOG:  could not bind IPv4 address 
> "127.0.0.1": Address already in use
> 2024-06-05 19:40:37.395 UTC [414110] HINT:  Is another postmaster 
> already running on port 50072? If not, wait a few seconds and retry.
>
> `grep '\b50072\b' -r testrun/` yields:
> testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 
> 19:40:37.392 UTC [414111] [unknown] LOG:  connection received: 
> host=localhost port=50072
> (a psql case)
>
> That is, psql from the test instance 001_ssltests_34 opened a 
> connection to
> the test server with the client port 50072 and it made using the port by
> the server from the test instance 001_ssltests_30 impossible.
>

Oh. (kicks self)

Should we really be allocating ephemeral server ports in the range 
41952..65535? Maybe we should be looking for an unallocated port 
somewhere below 41952, and above, say, 32767, so we couldn't have a 
client socket collision.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

05 June 2024, 21:17:04

On 2024-06-05 We 16:49, Andrew Dunstan wrote:
>
> On 2024-06-05 We 16:00, Alexander Lakhin wrote:
>> Hello Andrew,
>>
>> 05.06.2024 21:10, Andrew Dunstan wrote:
>>>
>>> I think I see what's going on here. It looks like it's because we 
>>> start the server in unix socket mode, and then switch to using TCP 
>>> as well.
>>>
>>> Can you try your test with this patch applied and see if the problem 
>>> persists? If we start in TCP mode the framework should test for a 
>>> port clash.
>>>
>>
>> It seems that the failure rate decreased (I guess the patch rules out 
>> the
>> case with two servers choosing the same port), but I still got:
>>
>> 16/53 postgresql:ssl / ssl/001_ssltests_36         OK 15.25s 205 
>> subtests passed
>> 17/53 postgresql:ssl / ssl/001_ssltests_30         ERROR 3.17s (exit 
>> status 255 or signal 127 SIGinvalid)
>>
>> 2024-06-05 19:40:37.395 UTC [414110] LOG:  starting PostgreSQL 
>> 17beta1 on x86_64-linux, compiled by gcc-13.2.1, 64-bit
>> 2024-06-05 19:40:37.395 UTC [414110] LOG:  could not bind IPv4 
>> address "127.0.0.1": Address already in use
>> 2024-06-05 19:40:37.395 UTC [414110] HINT:  Is another postmaster 
>> already running on port 50072? If not, wait a few seconds and retry.
>>
>> `grep '\b50072\b' -r testrun/` yields:
>> testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 
>> 19:40:37.392 UTC [414111] [unknown] LOG:  connection received: 
>> host=localhost port=50072
>> (a psql case)
>>
>> That is, psql from the test instance 001_ssltests_34 opened a 
>> connection to
>> the test server with the client port 50072 and it made using the port by
>> the server from the test instance 001_ssltests_30 impossible.
>>
>
> Oh. (kicks self)
>
> Should we really be allocating ephemeral server ports in the range 
> 41952..65535? Maybe we should be looking for an unallocated port 
> somewhere below 41952, and above, say, 32767, so we couldn't have a 
> client socket collision.


Except that I see this on my Ubuntu instance:

$ sudo sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768    60999

And indeed I see client sockets in that range.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ssl tests fail due to TCP port conflict

From

Tom Lane

Date:

05 June 2024, 21:37:15

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-06-05 We 16:00, Alexander Lakhin wrote:
>> That is, psql from the test instance 001_ssltests_34 opened a 
>> connection to
>> the test server with the client port 50072 and it made using the port by
>> the server from the test instance 001_ssltests_30 impossible.

> Oh. (kicks self)

D'oh.

> Should we really be allocating ephemeral server ports in the range 
> 41952..65535? Maybe we should be looking for an unallocated port 
> somewhere below 41952, and above, say, 32767, so we couldn't have a 
> client socket collision.

Hmm, are there really any standards about how these port numbers
are used?

I wonder if we don't need to just be prepared to retry the whole
thing a few times.  Even if it's true that "clients" shouldn't
choose ports below 41952, we still have a small chance of failure
against a non-Postgres server starting up at the wrong time.

            regards, tom lane

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

06 June 2024, 01:23:40

On 2024-06-05 We 17:37, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 2024-06-05 We 16:00, Alexander Lakhin wrote:
>>> That is, psql from the test instance 001_ssltests_34 opened a
>>> connection to
>>> the test server with the client port 50072 and it made using the port by
>>> the server from the test instance 001_ssltests_30 impossible.
>> Oh. (kicks self)
> D'oh.
>
>> Should we really be allocating ephemeral server ports in the range
>> 41952..65535? Maybe we should be looking for an unallocated port
>> somewhere below 41952, and above, say, 32767, so we couldn't have a
>> client socket collision.
> Hmm, are there really any standards about how these port numbers
> are used?
>
> I wonder if we don't need to just be prepared to retry the whole
> thing a few times.  Even if it's true that "clients" shouldn't
> choose ports below 41952, we still have a small chance of failure
> against a non-Postgres server starting up at the wrong time.


Yeah, I think you're right. One thing we should do is be careful to use 
the port as soon as possible after we have picked it, to reduce the 
possibility that something else will use it first.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

06 June 2024, 11:25:26

On 2024-06-05 We 16:00, Alexander Lakhin wrote:
> Hello Andrew,
>
> 05.06.2024 21:10, Andrew Dunstan wrote:
>>
>> I think I see what's going on here. It looks like it's because we 
>> start the server in unix socket mode, and then switch to using TCP as 
>> well.
>>
>> Can you try your test with this patch applied and see if the problem 
>> persists? If we start in TCP mode the framework should test for a 
>> port clash.
>>
>
> It seems that the failure rate decreased (I guess the patch rules out the
> case with two servers choosing the same port), but I still got:
>
> 16/53 postgresql:ssl / ssl/001_ssltests_36         OK 15.25s   205 
> subtests passed
> 17/53 postgresql:ssl / ssl/001_ssltests_30         ERROR 3.17s (exit 
> status 255 or signal 127 SIGinvalid)
>
> 2024-06-05 19:40:37.395 UTC [414110] LOG:  starting PostgreSQL 17beta1 
> on x86_64-linux, compiled by gcc-13.2.1, 64-bit
> 2024-06-05 19:40:37.395 UTC [414110] LOG:  could not bind IPv4 address 
> "127.0.0.1": Address already in use
> 2024-06-05 19:40:37.395 UTC [414110] HINT:  Is another postmaster 
> already running on port 50072? If not, wait a few seconds and retry.
>
> `grep '\b50072\b' -r testrun/` yields:
> testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 
> 19:40:37.392 UTC [414111] [unknown] LOG:  connection received: 
> host=localhost port=50072
> (a psql case)
>
> That is, psql from the test instance 001_ssltests_34 opened a 
> connection to
> the test server with the client port 50072 and it made using the port by
> the server from the test instance 001_ssltests_30 impossible.
>
>

After sleeping on it, I still think the patch would be a good thing. 
Your torture test might still show some failures, but the buildfarm 
isn't running those, and it might be enough to eliminate or at least 
substantially reduce buildfarm failures by reducing to almost zero the 
time in which a competing script might grab the port. The biggest 
problem with the current script is apparently that we delay using the 
TCP port by starting the server in Unix socket mode, and only switch to 
using TCP when we restart. If changing that doesn't fix the problem 
we'll have to rethink. If this isn't the cause, though, I would expect 
to have seen similar failures from other test suites.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ssl tests fail due to TCP port conflict

From

Jelte Fennema-Nio

Date:

06 June 2024, 22:02:28

On Wed, 5 Jun 2024 at 23:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 2024-06-05 We 16:00, Alexander Lakhin wrote:
> >> That is, psql from the test instance 001_ssltests_34 opened a
> >> connection to
> >> the test server with the client port 50072 and it made using the port by
> >> the server from the test instance 001_ssltests_30 impossible.
>
> > Oh. (kicks self)
>
> D'oh.
>
> > Should we really be allocating ephemeral server ports in the range
> > 41952..65535? Maybe we should be looking for an unallocated port
> > somewhere below 41952, and above, say, 32767, so we couldn't have a
> > client socket collision.
>
> Hmm, are there really any standards about how these port numbers
> are used?
>
> I wonder if we don't need to just be prepared to retry the whole
> thing a few times.  Even if it's true that "clients" shouldn't
> choose ports below 41952, we still have a small chance of failure
> against a non-Postgres server starting up at the wrong time.

My suggestion would be to not touch the ephemeral port range at all
for these ports. In practice the ephemeral port range is used for
cases where the operating system assigns the port, and the application
doesn't care whot it is. Not for when you want to get a free port, but
want to know in advance which one it is.

For the PgBouncer test suite we do something similar as the PG its
perl tests do, but there we allocate a port between 10200 and 32768:
https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215

Sure theoretically it's possible to hit a rare case where another
server starts up at the wrong time, but that chance seems way lower
than a client starting up at the wrong time. Especially since there
aren't many servers that use a port with 5 digits.

Attached is a patch that updates the port numbers.

Attachment

v1-0001-Don-t-use-ephemeral-port-range.patch

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

07 June 2024, 13:55:56

On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote:
> On Wed, 5 Jun 2024 at 23:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> On 2024-06-05 We 16:00, Alexander Lakhin wrote:
>>>> That is, psql from the test instance 001_ssltests_34 opened a
>>>> connection to
>>>> the test server with the client port 50072 and it made using the port by
>>>> the server from the test instance 001_ssltests_30 impossible.
>>> Oh. (kicks self)
>> D'oh.
>>
>>> Should we really be allocating ephemeral server ports in the range
>>> 41952..65535? Maybe we should be looking for an unallocated port
>>> somewhere below 41952, and above, say, 32767, so we couldn't have a
>>> client socket collision.
>> Hmm, are there really any standards about how these port numbers
>> are used?
>>
>> I wonder if we don't need to just be prepared to retry the whole
>> thing a few times.  Even if it's true that "clients" shouldn't
>> choose ports below 41952, we still have a small chance of failure
>> against a non-Postgres server starting up at the wrong time.
> My suggestion would be to not touch the ephemeral port range at all
> for these ports. In practice the ephemeral port range is used for
> cases where the operating system assigns the port, and the application
> doesn't care whot it is. Not for when you want to get a free port, but
> want to know in advance which one it is.
>
> For the PgBouncer test suite we do something similar as the PG its
> perl tests do, but there we allocate a port between 10200 and 32768:
> https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215
>
> Sure theoretically it's possible to hit a rare case where another
> server starts up at the wrong time, but that chance seems way lower
> than a client starting up at the wrong time. Especially since there
> aren't many servers that use a port with 5 digits.
>
> Attached is a patch that updates the port numbers.


Makes sense to me.

I still think my patch to force TCP mode for the SSL test makes sense as 
well.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ssl tests fail due to TCP port conflict

From

Tom Lane

Date:

07 June 2024, 14:25:45

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote:
>> For the PgBouncer test suite we do something similar as the PG its
>> perl tests do, but there we allocate a port between 10200 and 32768:
>> https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215

> Makes sense to me.

> I still think my patch to force TCP mode for the SSL test makes sense as 
> well.

+1 to both things.  If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.

            regards, tom lane

Re: ssl tests fail due to TCP port conflict

From

Alexander Lakhin

Date:

08 July 2024, 12:00:00

Hello,

07.06.2024 17:25, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I still think my patch to force TCP mode for the SSL test makes sense as
>> well.
> +1 to both things.  If that doesn't get the failure rate down to an
> acceptable level, we can look at the retry idea.

I'd like to add that the kerberos/001_auth test also suffers from the port
conflict, but slightly differently. Look for example at [1]:
krb5kdc.log contains:
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): setting up network...
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Cannot bind server socket

on 127.0.0.1.55853
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Failed setting up a UDP socket (for
127.0.0.1.55853)
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Error setting up network

As far as I can see, the port for kdc is chosen by
PostgreSQL::Test::Kerberos, via
PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP
port availability (with can_bind()), but not for UDP, so this increases
the probability of the conflict for this test (a similar failure: [2]).
Although we can also find a failure with TCP: [3]

(It's not clear to me, what processes can use UDP ports while testing,
but maybe those buildfarm animals are running on the same logical
machine simultaneously?)

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19

Best regards,
Alexander

Re: ssl tests fail due to TCP port conflict

From

Andrew Dunstan

Date:

08 July 2024, 19:40:37

On 2024-07-08 Mo 8:00 AM, Alexander Lakhin wrote:
> Hello,
>
> 07.06.2024 17:25, Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> I still think my patch to force TCP mode for the SSL test makes 
>>> sense as
>>> well.
>> +1 to both things.  If that doesn't get the failure rate down to an
>> acceptable level, we can look at the retry idea.


I have push patches for both of those (i.e. start SSL test nodes in TCP 
mode and change the range of ports we allocate server ports from)

I didn't see this email until after I had pushed them.


>
> I'd like to add that the kerberos/001_auth test also suffers from the 
> port
> conflict, but slightly differently. Look for example at [1]:
> krb5kdc.log contains:
> Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): 
> setting up network...
> Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): 
> Address already in use - Cannot bind server socket on 127.0.0.1.55853
> Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): 
> Failed setting up a UDP socket (for 127.0.0.1.55853)
> Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): 
> Address already in use - Error setting up network
>
> As far as I can see, the port for kdc is chosen by
> PostgreSQL::Test::Kerberos, via
> PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP
> port availability (with can_bind()), but not for UDP, so this increases
> the probability of the conflict for this test (a similar failure: [2]).
> Although we can also find a failure with TCP: [3]
>
> (It's not clear to me, what processes can use UDP ports while testing,
> but maybe those buildfarm animals are running on the same logical
> machine simultaneously?)
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15
> [2] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07
> [3] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19
>
>

Let's see if this persists now we are testing for free ports in a 
different range, not the range usually used for ephemeral ports.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com