Re: Patch: Implement failover on libpq connect level. - Mailing list pgsql-hackers

From Victor Wagner
Subject Re: Patch: Implement failover on libpq connect level.
Date
Msg-id 20151015094321.GB15844@wagner.pp.ru
Whole thread Raw
In response to Re: Patch: Implement failover on libpq connect level.  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
On 2015.10.15 at 17:13:46 +0800, Craig Ringer wrote:

> On 14 October 2015 at 18:41, Victor Wagner <vitus@wagner.pp.ru> wrote:
> 
> > 5. Added new parameter readonly=boolean. If this parameter is false (the
> > default) upon successful connection check is performed that backend is
> > not in the recovery state. If so, connection is not considered usable
> > and next host is tried.
> 
> What constitutes "failed" as far as this is concerned?

failed means "select pg_is_in_recovery() returns true". 
The only reason of adding this parameter is to state that we 
can use  connection to warm-backup slave (or if we cannot).

> Like the PgJDBC approach I wonder how much this'll handle in practice
> and how it'll go with less clear-cut failures like disk-full on a
> replica that's a member of the failover pool, long delays before
> no-route-to-host errors, dns problems, etc.

Really I don't think that disk-full and other such node errors are
problems of client library. It is cluster management software which
should check for these errors and shut down erroneous nodes or at least
disable their listening for connection.

As for long-timeouts on network failures, they can be handled by
existing connect_timeout parameter. 

May be it is worth effort to rewrite this parameter handling. Now it
only used in blocking functions which call connectDBComplete
(PQconnectdb[Params], PQreset and PQsetdbLogin). May be it should be hidden into
PQconnectPoll state machine, so event-driven applications could get its
handling for free.

Really, my initial proposal included hostorder=parallel especially for
this situation. But I've decided not to implement it right now, because
it requires changes in the libpq public API.
> Had any chance to simulate network failures?

Testing failover code is really tricky thing. All testing I've done yet
was in the mode "Shudown one of the nodes and see what happens".
It is not too hard to simulate network failures using iptables, but
requires considerable planning of the test scenario.


--                                   Victor Wagner <vitus@wagner.pp.ru>



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: PATCH: 9.5 replication origins fix for logical decoding
Next
From: Taiki Kondo
Date:
Subject: Re: [Proposal] Table partition + join pushdown