Re: ssl tests fail due to TCP port conflict - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: ssl tests fail due to TCP port conflict
Date
Msg-id c78b5fe5-0eb3-4117-9f81-a0725527bad5@dunslane.net
Whole thread Raw
In response to Re: ssl tests fail due to TCP port conflict  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-hackers
On 2024-06-05 We 16:00, Alexander Lakhin wrote:
> Hello Andrew,
>
> 05.06.2024 21:10, Andrew Dunstan wrote:
>>
>> I think I see what's going on here. It looks like it's because we 
>> start the server in unix socket mode, and then switch to using TCP as 
>> well.
>>
>> Can you try your test with this patch applied and see if the problem 
>> persists? If we start in TCP mode the framework should test for a 
>> port clash.
>>
>
> It seems that the failure rate decreased (I guess the patch rules out the
> case with two servers choosing the same port), but I still got:
>
> 16/53 postgresql:ssl / ssl/001_ssltests_36         OK 15.25s   205 
> subtests passed
> 17/53 postgresql:ssl / ssl/001_ssltests_30         ERROR 3.17s (exit 
> status 255 or signal 127 SIGinvalid)
>
> 2024-06-05 19:40:37.395 UTC [414110] LOG:  starting PostgreSQL 17beta1 
> on x86_64-linux, compiled by gcc-13.2.1, 64-bit
> 2024-06-05 19:40:37.395 UTC [414110] LOG:  could not bind IPv4 address 
> "127.0.0.1": Address already in use
> 2024-06-05 19:40:37.395 UTC [414110] HINT:  Is another postmaster 
> already running on port 50072? If not, wait a few seconds and retry.
>
> `grep '\b50072\b' -r testrun/` yields:
> testrun/ssl/001_ssltests_34/log/001_ssltests_34_primary.log:2024-06-05 
> 19:40:37.392 UTC [414111] [unknown] LOG:  connection received: 
> host=localhost port=50072
> (a psql case)
>
> That is, psql from the test instance 001_ssltests_34 opened a 
> connection to
> the test server with the client port 50072 and it made using the port by
> the server from the test instance 001_ssltests_30 impossible.
>
>

After sleeping on it, I still think the patch would be a good thing. 
Your torture test might still show some failures, but the buildfarm 
isn't running those, and it might be enough to eliminate or at least 
substantially reduce buildfarm failures by reducing to almost zero the 
time in which a competing script might grab the port. The biggest 
problem with the current script is apparently that we delay using the 
TCP port by starting the server in Unix socket mode, and only switch to 
using TCP when we restart. If changing that doesn't fix the problem 
we'll have to rethink. If this isn't the cause, though, I would expect 
to have seen similar failures from other test suites.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Compress ReorderBuffer spill files using LZ4
Next
From: Heikki Linnakangas
Date:
Subject: Re: ResourceOwner refactoring