Re: [HACKERS] [bug fix] PG10: libpq doesn't connect to alternativehosts when some errors occur - Mailing list pgsql-hackers

From Tels
Subject Re: [HACKERS] [bug fix] PG10: libpq doesn't connect to alternativehosts when some errors occur
Date
Msg-id e97450e4ecfad32cae2da2858b01c225.squirrel@sm.webmail.pair.com
Whole thread Raw
In response to Re: [HACKERS] [bug fix] PG10: libpq doesn't connect to alternativehosts when some errors occur  ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
List pgsql-hackers
On Thu, May 18, 2017 10:24 pm, Tsunakawa, Takayuki wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael
>> Paquier
>> On Thu, May 18, 2017 at 11:30 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > Because why?
>>
>> Because it is critical to let the user know as well *why* an error
>> happened.
>> Imagine that this feature is used with multiple nodes, all primaries.
>> If
>> a DB admin busted the credentials in one of them then all the load
>> would
>> be redirected on the other nodes, without knowing what is actually
>> causing
>> the error. Then the node where the credentials have been changed would
>> just
>> run idle, and the application would be unaware of that.
>
> In that case, the DBA can know the authentication errors in the server log
> of the idle instance.
>
> I'm sorry to repeat myself, but libpq connection failover is the feature
> for HA.  So I believe what to prioritize is HA.

I'm in agreement here, the feature for me sounds very useful for HA, but
HA means it needs to work as reliable as possible, not just "often enough"
:)

If one goes to the length to have multiple instances, there is surely also
monitoring in place to see if they are healthy or under load/stress.

The beaty of having libpq connecting to multiple hosts until one works is
that you can handle temporary unavailability (e.g. one instance is
restarted for patching) and general failure (one instance goes down to
whatever error) in one place and without having to implement this logic
into every app (database user connector).


Imagine f.i. that you have 3 hosts A, B and C and B.

There are two scenarioes here: Everyone tries "A,B,C", or everyone tries
random permutations like "A,C,B" or "B,C,A".

If In the first scenary, almost all connections would go to A, until it no
longer accepts no connections, then they spill over to B.

In the second one, each host gets 1/3 of all connections equally.

Now imagine  that B is down for either a brief period or permantently.

If libpq stops when it connects to B, then the scenarios play out like this:

1: Almost all connections run on A, but a random subset breaks when
spillover to B does not happen. Host C is idle.

2: 2/3 of all connections just work, 1/3 just breaks. Both A and C have a
higher load than usual.

If libpq skips B and continues, then we have instead:

1: Almost all connections run on A, but a random subset spills over to C
after skipping B.

2: All connections run on A or C, B is always skipped if it appears before
A or C.

The admin would see on the monitoring that B is offline (briefly or
permanent) and need to correct it.

From the user's perspective, the second variant is smooth, the first is
breaking randomly. A "database user" would not really want to know that B
is down or why, it would just expect to get a working DB connection.

That's my 0.02 € anyway.

Tels




pgsql-hackers by date:

Previous
From: Rafia Sabih
Date:
Subject: [HACKERS] [POC] Faster processing at Gather node
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Preliminary results for proposed new pgindent implementation