Thread: Supporting TCP_SYNCNT in libpq

Supporting TCP_SYNCNT in libpq

From
Francesco Canovai
Date:
This patch introduces support for a `tcp_syn_count` parameter in
libpq, allowing control over the number of SYN retransmissions when
initiating a connection.

The primary goal is to prevent the walreceiver from getting stuck
resending SYNs for an extended period, up to
`net.ipv4.tcp_syn_retries` (127 seconds by default), in the event of
network disruptions.

A specific scenario where this can occur is during a failover in Kubernetes:

* The primary node fails, a standby is promoted, and other standbys
attempt to reconnect to the service representing the new primary.
* The `primary_conninfo` of the standby points to a service, usually
managed via iptables rules.
* If the walreceiver's initial SYN is dropped due to outdated rules,
the connection may remain stranded until the system timeout is
reached.
* As a result, a second standby may reattach after a couple of
minutes. In the case of synchronous replication, this can block the
writes from the application.

In this scenario, `tcp_user_timeout` could close a connection retrying
the SYNs (even though it doesn't seem to do it from the documentation,
it works) the parameter will affect the entire connection.
`connect_timeout`, doesn't work with `PQconnectPoll`, so it won't
prevent the walreceiver from timing out.

Thank you,
Francesco

Attachment

Re: Supporting TCP_SYNCNT in libpq

From
Gabriele Bartolini
Date:
Ciao Francesco,

On Mon, 17 Mar 2025 at 09:19, Francesco Canovai <francesco.canovai@enterprisedb.com> wrote:
This patch introduces support for a `tcp_syn_count` parameter in
libpq, allowing control over the number of SYN retransmissions when
initiating a connection.

The primary goal is to prevent the walreceiver from getting stuck
resending SYNs for an extended period, up to
`net.ipv4.tcp_syn_retries` (127 seconds by default), in the event of
network disruptions.

Thanks for bringing this up and providing this straightforward patch. Configuring this TCP setting on the WAL receiver side will give us more precise control over connections, specifically over replication behaviour. This is especially important in Kubernetes environments with operators like CloudNativePG, where modifying this setting at a lower level may not be feasible due to separation of duties and permission constraints.

Ciao,
Gabriele
--
Gabriele Bartolini
VP, Chief Architect, Kubernetes

Re: Supporting TCP_SYNCNT in libpq

From
Andres Freund
Date:
Hi,

On 2025-03-13 09:37:37 +0100, Francesco Canovai wrote:
> In this scenario, `tcp_user_timeout` could close a connection retrying
> the SYNs (even though it doesn't seem to do it from the documentation,
> it works) the parameter will affect the entire connection.
> `connect_timeout`, doesn't work with `PQconnectPoll`, so it won't
> prevent the walreceiver from timing out.

Why not implement timeout support for PQconnectPoll?

Greetings,

Andres Freund



Re: Supporting TCP_SYNCNT in libpq

From
Peter Eisentraut
Date:
On 18.03.25 21:18, Andres Freund wrote:
> On 2025-03-13 09:37:37 +0100, Francesco Canovai wrote:
>> In this scenario, `tcp_user_timeout` could close a connection retrying
>> the SYNs (even though it doesn't seem to do it from the documentation,
>> it works) the parameter will affect the entire connection.
>> `connect_timeout`, doesn't work with `PQconnectPoll`, so it won't
>> prevent the walreceiver from timing out.
> 
> Why not implement timeout support for PQconnectPoll?

Yes, that seems better.  It is currently documented that this is 
intentionally not supported (see just above [0]), but we should find a 
way to solve that.


[0]: 
https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQSOCKETPOLL