Thread: [HACKERS] [doc fix] PG10: wroing description on connect_timeout when multiplehosts are specified

Hello, Robert

I found a wrong sentence here in the doc.  I'm sorry, this is what I asked you to add...

https://www.postgresql.org/docs/devel/static/libpq-connect.html#libpq-paramkeywords

connect_timeout
Maximum wait for connection, in seconds (write as a decimal integer string). Zero or not specified means wait
indefinitely.It is not recommended to use a timeout of less than 2 seconds. This timeout applies separately to each
connectionattempt. For example, if you specify two hosts and both of them are unreachable, and connect_timeout is 5,
thetotal time spent waiting for a connection might be up to 10 seconds.
 


The program behavior is that libpq times out after connect_timeout seconds regardless of how many hosts are specified.
Iconfirmed it like this:
 

$ export PGOPTIONS="-c post_auth_delay=30"
$ psql -d "dbname=postgres connect_timeout=5" -h localhost,localhost -p 5432,5433
(psql erros out after 5 seconds)

Could you fix the doc with something like this?

"This timeout applies across all the connection attempts. For example, if you specify two hosts and both of them are
unreachable,and connect_timeout is 5, the total time spent waiting for a connection is up to 5 seconds."
 

Should we also change the minimum "2 seconds" part to be longer, according to the number of hosts?

Regards
Takayuki Tsunakawa




From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tsunakawa,
> Takayuki
> I found a wrong sentence here in the doc.  I'm sorry, this is what I asked
> you to add...
> 
> https://www.postgresql.org/docs/devel/static/libpq-connect.html#libpq-
> paramkeywords
> 
> connect_timeout
> Maximum wait for connection, in seconds (write as a decimal integer string).
> Zero or not specified means wait indefinitely. It is not recommended to
> use a timeout of less than 2 seconds. This timeout applies separately to
> each connection attempt. For example, if you specify two hosts and both
> of them are unreachable, and connect_timeout is 5, the total time spent
> waiting for a connection might be up to 10 seconds.
> 
> 
> The program behavior is that libpq times out after connect_timeout seconds
> regardless of how many hosts are specified.  I confirmed it like this:
> 
> $ export PGOPTIONS="-c post_auth_delay=30"
> $ psql -d "dbname=postgres connect_timeout=5" -h localhost,localhost -p
> 5432,5433
> (psql erros out after 5 seconds)
> 
> Could you fix the doc with something like this?
> 
> "This timeout applies across all the connection attempts. For example, if
> you specify two hosts and both of them are unreachable, and connect_timeout
> is 5, the total time spent waiting for a connection is up to 5 seconds."
> 
> Should we also change the minimum "2 seconds" part to be longer, according
> to the number of hosts?

Instead, I think we should fix the program to match the documented behavior.  Otherwise, if the first database machine
isdown, libpq might wait for about 2 hours (depending on the OS's TCP keepalive setting), during which it tims out
afterconnect_timeout and does not attempt to connect to other hosts.
 

I'll add this item in the PostgreSQL 10 Open Items.

Regards
Takayuki Tsunakawa




From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tsunakawa,> Takayuki
> Instead, I think we should fix the program to match the documented behavior.
> Otherwise, if the first database machine is down, libpq might wait for about
> 2 hours (depending on the OS's TCP keepalive setting), during which it tims
> out after connect_timeout and does not attempt to connect to other hosts.
> 
> I'll add this item in the PostgreSQL 10 Open Items.

Please use the attached patch to fix the problem.  I confirmed the success as follows:

$ add "post_auth_delay = 10" in postgresql.conf on host1
$ start database servers on host1 and host2
$ psql -h host1,host2 -p 5432,5433 -d "dbname=postgres connect_timeout=3"
(psql connected to host2 after 3 seconds, which I checked with \conninfo)

Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Fri, May 12, 2017 at 08:54:13AM +0000, Tsunakawa, Takayuki wrote:
> From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tsunakawa,
> > Takayuki
> > I found a wrong sentence here in the doc.  I'm sorry, this is what I asked
> > you to add...
> > 
> > https://www.postgresql.org/docs/devel/static/libpq-connect.html#libpq-
> > paramkeywords
> > 
> > connect_timeout
> > Maximum wait for connection, in seconds (write as a decimal integer string).
> > Zero or not specified means wait indefinitely. It is not recommended to
> > use a timeout of less than 2 seconds. This timeout applies separately to
> > each connection attempt. For example, if you specify two hosts and both
> > of them are unreachable, and connect_timeout is 5, the total time spent
> > waiting for a connection might be up to 10 seconds.
> > 
> > 
> > The program behavior is that libpq times out after connect_timeout seconds
> > regardless of how many hosts are specified.  I confirmed it like this:
> > 
> > $ export PGOPTIONS="-c post_auth_delay=30"
> > $ psql -d "dbname=postgres connect_timeout=5" -h localhost,localhost -p
> > 5432,5433
> > (psql erros out after 5 seconds)
> > 
> > Could you fix the doc with something like this?
> > 
> > "This timeout applies across all the connection attempts. For example, if
> > you specify two hosts and both of them are unreachable, and connect_timeout
> > is 5, the total time spent waiting for a connection is up to 5 seconds."
> > 
> > Should we also change the minimum "2 seconds" part to be longer, according
> > to the number of hosts?
> 
> Instead, I think we should fix the program to match the documented behavior.  Otherwise, if the first database
machineis down, libpq might wait for about 2 hours (depending on the OS's TCP keepalive setting), during which it tims
outafter connect_timeout and does not attempt to connect to other hosts.
 
> 
> I'll add this item in the PostgreSQL 10 Open Items.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



On Sun, May 14, 2017 at 11:45 PM, Noah Misch <noah@leadboat.com> wrote:
>> I'll add this item in the PostgreSQL 10 Open Items.
>
> [Action required within three days.  This is a generic notification.]

I think there is a good argument that the existing behavior is as per
the documentation, but I think we may want to change it anyway.  What
the documentation is saying - or at least what I believe I intended
for it to say - is that connect_timeout is restarted for each new
host, so you could end up waiting longer than connect_timeout - but
not forever - if you specify multiple hosts.  And I believe that
statement to be correct. Takayuki Tsunakawa is saying something
different. He's saying that when connect_timeout expires, we should
try the next host instead of giving up. That may or may not be a good
idea, but it doesn't contradict the passage from the documentation
which he quoted.  That passage from the documentation doesn't say
anything at all about what happens when connect_timeout expires.  It
only talks about how much time might pass before that happens.

Takayuki Tsunakawa raised a very similar issue in another thread
related to another open item, namely
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F6F5659%40G01JPEXMBYT05
in which he argued that libpq ought to try then next host after a
connection failure regardless of the reason for the connection
failure.  Tom, Michael Paquier, and I all disagreed; none of us
believe that this feature was intended to retry the connection to a
different host after an arbitrary error reported by the remote server.
This thread is essentially the same issue, except here the question
isn't what should happen after we connect to a server and it returns
an error, but rather what happens when we time out waiting to connect
to a server.  When that happens, should we give up, or try the next
server?

Despite the chorus of support for the opposite conclusion on the other
thread, I'm inclined to think that it would be best to change the
behavior here as per the proposed patch.  The point of being able to
specify multiple hosts is to be able to have multiple database servers
(or perhaps, multiple ways to access the same database server) and use
whichever one of those servers is currently up.  I think that when the
server fails with a complaint like "I've never heard of the database
to which you want to connect" that's not a case of the server being
down, but some other kind of trouble that the administrator really
ought to fix; thus it's best to stop and report the error.  But if
connect_timeout expires, that sounds a whole lot like the server being
down.  It sounds morally equivalent to socket() or connect() failing
outright, which *would* trigger advancing to the next host.

So I'm inclined to accept the patch, but as a definitional change
rather than a bug fix.  However, I'd like to hear some other opinions.
I'll wait until Friday for such opinions to arrive, and then update on
next steps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> Takayuki Tsunakawa raised a very similar issue in another thread
> related to another open item, namely
> https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F6F5659%40G01JPEXMBYT05
> in which he argued that libpq ought to try then next host after a
> connection failure regardless of the reason for the connection
> failure.  Tom, Michael Paquier, and I all disagreed; none of us
> believe that this feature was intended to retry the connection to a
> different host after an arbitrary error reported by the remote server.
> This thread is essentially the same issue, except here the question
> isn't what should happen after we connect to a server and it returns
> an error, but rather what happens when we time out waiting to connect
> to a server.  When that happens, should we give up, or try the next
> server?

FWIW, I think the position most of us were taking is that this feature
is meant to retry transport-level connection failures, not cases where
we successfully make a connection to a server and then it rejects our
login attempt.  I would classify a timeout as a transport-level failure
as long as it occurs before we got any server response --- if it happens
during the authentication protocol, that's less clear.  But it might not
be very practical to distinguish those two cases.

In short, +1 for retrying on timeout during connection, and I'm okay with
retrying a timeout during authentication if it's not practical to treat
that differently.
        regards, tom lane



On Tue, May 16, 2017 at 3:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, I think the position most of us were taking is that this feature
> is meant to retry transport-level connection failures, not cases where
> we successfully make a connection to a server and then it rejects our
> login attempt.  I would classify a timeout as a transport-level failure
> as long as it occurs before we got any server response --- if it happens
> during the authentication protocol, that's less clear.  But it might not
> be very practical to distinguish those two cases.
>
> In short, +1 for retrying on timeout during connection, and I'm okay with
> retrying a timeout during authentication if it's not practical to treat
> that differently.

Sensible argument here. It could happen that a server is unresponsive,
for example in split-brains (?).

I have been playing a bit with the patch.

+ *
+ * Returns -1 on failure, 0 if the socket is readable/writable, 1 if
it timed out. */
pqWait is internal to libpq, so we are free to set up what we want
here. Still I think that we should be consistent with what
pqSocketCheck returns:
* >0 means that the socket is readable/writable, counting things.
* 0 is for timeout.
* -1 on failure.

+    int            ret = 0;
+    int            timeout = 0;
The declaration of "ret" should be internal in the for(;;) loop.

+           /* Attempt connection to the next host, starting the
connect_timeout timer */
+           pqDropConnection(conn, true);
+           conn->addr_cur = conn->connhost[conn->whichhost].addrlist;
+           conn->status = CONNECTION_NEEDED;
+           finish_time = time(NULL) + timeout;
+       }
I think that it would be safer to not set finish_time if
conn->connect_timeout is NULL. I agree that your code works because
pqWaitTimed() will never complain on timeout reached if finish_time is
-1. That's for robustness sake.

The docs say that for connect_timeout:     <para>      Maximum wait for connection, in seconds (write as a decimal
integer     string). Zero or not specified means wait indefinitely.  It is not      recommended to use a timeout of
lessthan 2 seconds.      This timeout applies separately to each connection attempt.      For example, if you specify
twohosts and both of them are unreachable,      and <literal>connect_timeout</> is 5, the total time spent waiting for
a     connection might be up to 10 seconds.     </para>
 

It seems to me that this implies that if a timeout occurs on the first
connection, the counter is reset, which is what this patch is doing.
So we are all set.
-- 
Michael



Hello Robert, Tom, Michael,

Thanks a lot for checking my patch.  Sorry, let me check Michael's review comments and reply tomorrow.

Regards
Takayuki Tsunakawa


On Mon, May 15, 2017 at 9:59 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> + *
> + * Returns -1 on failure, 0 if the socket is readable/writable, 1 if
> it timed out.
>   */
> pqWait is internal to libpq, so we are free to set up what we want
> here. Still I think that we should be consistent with what
> pqSocketCheck returns:
> * >0 means that the socket is readable/writable, counting things.
> * 0 is for timeout.
> * -1 on failure.

That would imply a lot more change, though.  The way that the patch
currently does it, none of the other callers of pqWait() or
pqWaitTimed() need to be adjusted.  So I prefer the way that Tsunakawa
Takayuki currently has this over your proposal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, May 17, 2017 at 12:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 15, 2017 at 9:59 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> + *
>> + * Returns -1 on failure, 0 if the socket is readable/writable, 1 if
>> it timed out.
>>   */
>> pqWait is internal to libpq, so we are free to set up what we want
>> here. Still I think that we should be consistent with what
>> pqSocketCheck returns:
>> * >0 means that the socket is readable/writable, counting things.
>> * 0 is for timeout.
>> * -1 on failure.
>
> That would imply a lot more change, though.  The way that the patch
> currently does it, none of the other callers of pqWait() or
> pqWaitTimed() need to be adjusted.  So I prefer the way that Tsunakawa
> Takayuki currently has this over your proposal.

Consistency in APIs matters, but I won't fight hard in favor of this
item either. In short I am fine to discard this comment.
-- 
Michael



From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
> pqWait is internal to libpq, so we are free to set up what we want here.
> Still I think that we should be consistent with what pqSocketCheck returns:

Please let this what it is now for the same reason Robert mentioned.

> +    int            ret = 0;
> +    int            timeout = 0;
> The declaration of "ret" should be internal in the for(;;) loop.

Done.


> +           /* Attempt connection to the next host, starting the
> connect_timeout timer */
> +           pqDropConnection(conn, true);
> +           conn->addr_cur = conn->connhost[conn->whichhost].addrlist;
> +           conn->status = CONNECTION_NEEDED;
> +           finish_time = time(NULL) + timeout;
> +       }
> I think that it would be safer to not set finish_time if
> conn->connect_timeout is NULL. I agree that your code works because
> pqWaitTimed() will never complain on timeout reached if finish_time is -1.
> That's for robustness sake.

Done, but I'm not sure how this contributes to the robustness.  I guess you were concerned just in case pqWaitTimed()
returned0 (timeout) even when it should not.
 

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, May 17, 2017 at 11:49 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
>> pqWait is internal to libpq, so we are free to set up what we want here.
>> Still I think that we should be consistent with what pqSocketCheck returns:
>
> Please let this what it is now for the same reason Robert mentioned.
>
>> +    int            ret = 0;
>> +    int            timeout = 0;
>> The declaration of "ret" should be internal in the for(;;) loop.
>
> Done.
>
>> +           /* Attempt connection to the next host, starting the
>> connect_timeout timer */
>> +           pqDropConnection(conn, true);
>> +           conn->addr_cur = conn->connhost[conn->whichhost].addrlist;
>> +           conn->status = CONNECTION_NEEDED;
>> +           finish_time = time(NULL) + timeout;
>> +       }
>> I think that it would be safer to not set finish_time if
>> conn->connect_timeout is NULL. I agree that your code works because
>> pqWaitTimed() will never complain on timeout reached if finish_time is -1.
>> That's for robustness sake.
>
> Done, but I'm not sure how this contributes to the robustness.  I guess you were concerned just in case pqWaitTimed()
returned0 (timeout) even when it should not.
 

Thanks for the updated patch. This looks good to me.
-- 
Michael



On Tue, May 16, 2017 at 11:58 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Thanks for the updated patch. This looks good to me.

Committed.  I also added a slight tweak to the wording of the documentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> Committed.  I also added a slight tweak to the wording of the documentation.

Thank you, the doc looks clearer.

Regards
Takayuki Tsunakawa