Re: Possible fix for occasional failures on castoroides etc - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Possible fix for occasional failures on castoroides etc
Date
Msg-id 29006.1399137932@sss.pgh.pa.us
Whole thread Raw
In response to Re: Possible fix for occasional failures on castoroides etc  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Possible fix for occasional failures on castoroides etc  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Possible fix for occasional failures on castoroides etc  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-09-17 08:23:01 -0400, Dave Page wrote:
>> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.

> I've just noticed (while checking whether backporting 4c8aa8b5aea caused
> problems) that this doesn't seem to have fixed the issue. One further
> thing to try would be to try whether tcp connections don't have the same
> problem.

I did some googling on this, and found out that people have seen identical
behavior on Solaris with mysql and other products, so at least we're not
alone.  Googling also reminded me that we could have a look at the source
(duh), which is still available from hg.openindiana.org.  I poked around
a bit and more or less confirmed the theory mentioned here:
https://www.varnish-cache.org/trac/ticket/865
That is, Solaris' unix-sockets code will generate ECONNREFUSED if it
finds that the socket is not connected and not waiting for a connection
*and* there is no saved error code.  One example is:
if (so->so_error != 0)    return (sogeterr(so, B_TRUE));/* * Under normal circumstances, so_error should contain an
error* in case the connect failed. However, it is possible for another * thread to come in a consume the error, so
generatea sensible * error in that case. */if ((so->so_state & SS_ISCONNECTED) == 0)    return (ECONNREFUSED);
 

Now, I can't imagine where the "other thread" hypothesized in this comment
could be, so what I'm thinking is that maybe there's a bug somewhere that
drops the connection attempt without setting any error in so_error; or
maybe there's a race condition that releases the waiting client before
so_error is set.  But that still leaves the question of why the connection
attempt is getting dropped at all.

BTW, I also found no less an authority than W. Richard Stevens saying that
my theory that this could happen from accept queue overflow was wrong, at
least in a sane implementation:

https://groups.google.com/forum/#!topic/comp.unix.solaris/e8QxFyXxr84

: >        - there are too many outstanding connections that haven't
: >          been accepted yet (perhaps you can up the second parameter
: >          to listen)
:
: No.  When the pending connection queue is filled, TCP ignores an
: arriving SYN, it does not respond with an RST.  This is a soft error
: (a busy server) and by ignoring it, TCP forces the client to retransmit
: the SYN, hopefully finding a less busy server at some time in the future.
: For additional details and an example, check out pp. 257-260 of my "TCP/IP
: Illustrated" (Addison-Wesley, 1994).
: 
:         Rich Stevens

Unfortunately, it seems the Solaris implementors didn't read Stevens,
because it looks to me like they *do* return ECONNREFUSED on accept queue
overflow.  Still, it's hard to see how that would be the issue if we're
still seeing this failure with only five clients.

There are just not that many references to ECONNREFUSED in the portions of
the Solaris source tree that look like they could be related to Unix
sockets, so it's hard to come up with more theories than this.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Sending out a request for more buildfarm animals?
Next
From: Bruce Momjian
Date:
Subject: pgindent run