On Thu, 2026-03-05 at 14:59 +0000, Evgeny Kuzin wrote:
> We run a PostgreSQL clusters with streaming replication. After a failover, the old primary
> becomes a standby and vice versa. The challenge is: how do clients find the new primary?
>
> Current options:
> 1. Update DNS on every failover - operationally complex, TTL delays, requires automation
Your proposal would also suffer from TTL delays in the case of a cluster reconfiguration.
> 2. Consul/etcd - adds operational complexity and another failure domain
> 3. Multiple hosts in connection string - requires application changes when cluster
> topology changes (e.g., adding a new standby)
>
> The proposed approach:
> * Single A-record (db.internal) pointing to all cluster member IPs
> * Clients connect with
> host=db.internal target_session_attrs=read-write
> * libpq tries each IP until it finds the primary
>
> IIUC this is how JDBC'stargetServerType=primary works - it iterates through all resolved
> addresses. The "useless connection attempts" are actually the feature: it's probing to
> find the right server, same as when you specify multiple hosts explicitly.
> The only difference fromhost=pg1,pg2,pg3 is that DNS provides the list instead of the
> connection string. From libpq's perspective, why should it matter where the address list came from?
I see the point of your proposal.
One example of what Tom worries about is "localhost" resolving to both "127.0.0.1" and "::1",
a very common case. With the proposed change, any connection attempt to "localhost" that fails
would now take twice as long to fail. Also, if the problem is authentication, the server would
perform two authentication attempts. That is a clear regression that may affect many people.
The question is whether the overall benefits of your proposal (which certainly makes sense
in a setup like you describe) would be worth a performance and resource usage regression like
the one I described above. Or can you see a way to modify your approach so that that problem
can be avoided?
Yours,
Laurenz Albe