Re: [HACKERS] [bug fix] PG10: libpq doesn't connect to alternativehosts when some errors occur - Mailing list pgsql-hackers

From Tels
Subject Re: [HACKERS] [bug fix] PG10: libpq doesn't connect to alternativehosts when some errors occur
Date
Msg-id 3124955edc7b555878be4e2a98b6ef66.squirrel@sm.webmail.pair.com
Whole thread Raw
In response to Re: [HACKERS] [bug fix] PG10: libpq doesn't connect to alternativehosts when some errors occur  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Moin,

On Wed, May 17, 2017 12:34 pm, Robert Haas wrote:
> On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> What do you think of the following cases?  Don't you want to connect to
>> other servers?
>>
>> * The DBA shuts down the database.  The server takes a long time to do
>> checkpointing.  During the shutdown checkpoint, libpq tries to connect
>> to the server and receive an error "the database system is shutting
>> down."
>>
>> * The former primary failed and now is trying to start as a standby,
>> catching up by applying WAL.  During the recovery, libpq tries to
>> connect to the server and receive an error "the database system is
>> performing recovery."
>>
>> * The database server crashed due to a bug.  Unfortunately, the server
>> takes unexpectedly long time to shut down because it takes many seconds
>> to write the stats file (as you remember, Tom-san experienced 57 seconds
>> to write the stats file during regression tests.)  During the stats file
>> write, libpq tries to connect to the server and receive an error "the
>> database system is shutting down."
>>
>> These are equivalent to server failure.  I believe we should prioritize
>> rescuing errors during operation over detecting configuration errors.
>
> Yeah, you have a point.  I'm willing to admit that we may have defined
> the behavior of the feature incorrectly, provided that you're willing
> to admit that you're proposing a definition change, not just a bug
> fix.
>
> Anybody else want to weigh in with an opinion here?

Hm, to me the feature needs to be reliable (for certain values of
reliable) to be usefull.

Consider that you have X hosts (rendundancy), and a lot of applications
that want a stable connection to the one that (still) works, whichever
this is.

You can then either:

1. make one primary, the other standby(s) and play DNS tricks or similiar
to make it appear that there is only one working host, and have all apps
connect to the "one host" (and reconnect to it upon failure)

2. let each app try each host until it finds a working one, if the
connection breaks, retry with the next host

3. or use libpq and let it try the hosts for you.

However, if I understand it correctly, #3 only works reliable in certain
cases (e.g. host down), but not if it is "sort of down". In that case each
app would again need code to retry different hosts until it finds a
working one, instead of letting libpq do the work.

That sound hard to deploy #3 in praxis, as you might easily just code up
#1 or #2 and call it a day.

All the best,

Tels



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: pgindent (was Re: [HACKERS] [COMMITTERS] pgsql: Preventive maintenance in advance of pgindent run.)
Next
From: Jeevan Ladhe
Date:
Subject: Re: [HACKERS] remove unnecessary flag has_null from PartitionBoundInfoData