Thread: Possible fix for occasional failures on castoroides etc

Possible fix for occasional failures on castoroides etc

From

Tom Lane

Date:

16 September 2012, 16:04:22

It's annoying that the buildfarm animals running on older versions of
Solaris randomly fail with "Connection refused" errors, such as in
today's example:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52

I believe what's probably happening there is that the kernel has a small
hard-wired limit on the length of the postmaster's accept queue, and you
get this failure if too many connection attempts arrive faster than the
postmaster can service them.  If that theory is correct, we could
probably prevent these failures by reducing the number of tests run in
parallel, which could be done by adding sayMAX_CONNECTIONS=5
to the environment in which the regression tests run.  I'm not sure
though if that's "build_env" or some other setting for the buildfarm
script --- Andrew?
        regards, tom lane

Re: Possible fix for occasional failures on castoroides etc

From

Andrew Dunstan

Date:

16 September 2012, 16:45:07

On 09/16/2012 12:04 PM, Tom Lane wrote:
> It's annoying that the buildfarm animals running on older versions of
> Solaris randomly fail with "Connection refused" errors, such as in
> today's example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52
>
> I believe what's probably happening there is that the kernel has a small
> hard-wired limit on the length of the postmaster's accept queue, and you
> get this failure if too many connection attempts arrive faster than the
> postmaster can service them.  If that theory is correct, we could
> probably prevent these failures by reducing the number of tests run in
> parallel, which could be done by adding say
>     MAX_CONNECTIONS=5
> to the environment in which the regression tests run.  I'm not sure
> though if that's "build_env" or some other setting for the buildfarm
> script --- Andrew?
>
>             


Yes, in the build_env section of the config file.

It's in the distributed sample config file, commented out.

cheers

andrew

Re: Possible fix for occasional failures on castoroides etc

From

Dave Page

Date:

17 September 2012, 12:23:05

On Sun, Sep 16, 2012 at 12:44 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
> On 09/16/2012 12:04 PM, Tom Lane wrote:
>>
>> It's annoying that the buildfarm animals running on older versions of
>> Solaris randomly fail with "Connection refused" errors, such as in
>> today's example:
>>
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52
>>
>> I believe what's probably happening there is that the kernel has a small
>> hard-wired limit on the length of the postmaster's accept queue, and you
>> get this failure if too many connection attempts arrive faster than the
>> postmaster can service them.  If that theory is correct, we could
>> probably prevent these failures by reducing the number of tests run in
>> parallel, which could be done by adding say
>>         MAX_CONNECTIONS=5
>> to the environment in which the regression tests run.  I'm not sure
>> though if that's "build_env" or some other setting for the buildfarm
>> script --- Andrew?
>>
>>
>
>
>
> Yes, in the build_env section of the config file.
>
> It's in the distributed sample config file, commented out.

I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Possible fix for occasional failures on castoroides etc

From

Andres Freund

Date:

03 May 2014, 15:10:09

On 2012-09-17 08:23:01 -0400, Dave Page wrote:
> On Sun, Sep 16, 2012 at 12:44 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> >
> > On 09/16/2012 12:04 PM, Tom Lane wrote:
> >>
> >> It's annoying that the buildfarm animals running on older versions of
> >> Solaris randomly fail with "Connection refused" errors, such as in
> >> today's example:
> >>
> >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52
> >>
> >> I believe what's probably happening there is that the kernel has a small
> >> hard-wired limit on the length of the postmaster's accept queue, and you
> >> get this failure if too many connection attempts arrive faster than the
> >> postmaster can service them.  If that theory is correct, we could
> >> probably prevent these failures by reducing the number of tests run in
> >> parallel, which could be done by adding say
> >>         MAX_CONNECTIONS=5
> >> to the environment in which the regression tests run.  I'm not sure
> >> though if that's "build_env" or some other setting for the buildfarm
> >> script --- Andrew?
> >>
> >>
> >
> >
> >
> > Yes, in the build_env section of the config file.
> >
> > It's in the distributed sample config file, commented out.
> 
> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.

I've just noticed (while checking whether backporting 4c8aa8b5aea caused
problems) that this doesn't seem to have fixed the issue. One further
thing to try would be to try whether tcp connections don't have the same
problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Possible fix for occasional failures on castoroides etc

From

Tom Lane

Date:

03 May 2014, 17:25:43

Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-09-17 08:23:01 -0400, Dave Page wrote:
>> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.

> I've just noticed (while checking whether backporting 4c8aa8b5aea caused
> problems) that this doesn't seem to have fixed the issue. One further
> thing to try would be to try whether tcp connections don't have the same
> problem.

I did some googling on this, and found out that people have seen identical
behavior on Solaris with mysql and other products, so at least we're not
alone.  Googling also reminded me that we could have a look at the source
(duh), which is still available from hg.openindiana.org.  I poked around
a bit and more or less confirmed the theory mentioned here:
https://www.varnish-cache.org/trac/ticket/865
That is, Solaris' unix-sockets code will generate ECONNREFUSED if it
finds that the socket is not connected and not waiting for a connection
*and* there is no saved error code.  One example is:
if (so->so_error != 0)    return (sogeterr(so, B_TRUE));/* * Under normal circumstances, so_error should contain an
error* in case the connect failed. However, it is possible for another * thread to come in a consume the error, so
generatea sensible * error in that case. */if ((so->so_state & SS_ISCONNECTED) == 0)    return (ECONNREFUSED);

Now, I can't imagine where the "other thread" hypothesized in this comment
could be, so what I'm thinking is that maybe there's a bug somewhere that
drops the connection attempt without setting any error in so_error; or
maybe there's a race condition that releases the waiting client before
so_error is set.  But that still leaves the question of why the connection
attempt is getting dropped at all.

BTW, I also found no less an authority than W. Richard Stevens saying that
my theory that this could happen from accept queue overflow was wrong, at
least in a sane implementation:

https://groups.google.com/forum/#!topic/comp.unix.solaris/e8QxFyXxr84

: >        - there are too many outstanding connections that haven't
: >          been accepted yet (perhaps you can up the second parameter
: >          to listen)
:
: No.  When the pending connection queue is filled, TCP ignores an
: arriving SYN, it does not respond with an RST.  This is a soft error
: (a busy server) and by ignoring it, TCP forces the client to retransmit
: the SYN, hopefully finding a less busy server at some time in the future.
: For additional details and an example, check out pp. 257-260 of my "TCP/IP
: Illustrated" (Addison-Wesley, 1994).
: 
:         Rich Stevens

Unfortunately, it seems the Solaris implementors didn't read Stevens,
because it looks to me like they *do* return ECONNREFUSED on accept queue
overflow.  Still, it's hard to see how that would be the issue if we're
still seeing this failure with only five clients.

There are just not that many references to ECONNREFUSED in the portions of
the Solaris source tree that look like they could be related to Unix
sockets, so it's hard to come up with more theories than this.
        regards, tom lane

Re: Possible fix for occasional failures on castoroides etc

From

Tom Lane

Date:

03 May 2014, 18:59:38

I wrote:
> Unfortunately, it seems the Solaris implementors didn't read Stevens,
> because it looks to me like they *do* return ECONNREFUSED on accept queue
> overflow.  Still, it's hard to see how that would be the issue if we're
> still seeing this failure with only five clients.

Also, after further inspection of the source code, it appears to me that
the kernel's limit on accept queue length is hard-wired at 4096 in
Solaris.  So there's basically no way that we're hitting that limit in the
regression tests, and the MAX_CONNECTIONS configuration is irrelevant.

We seem to be left with the race condition theory.  In that connection,
this comment in /usr/src/uts/common/io/tl.c is interesting:
*    The T_CONN_CON is generated when processing the T_CONN_REQ i.e. before*    a T_CONN_RES is received from the
acceptor.This means that a socket*    connect will complete before the peer has called accept.

I'm not sure that explains anything of value, but it's probably unlike any
other implementation, which makes it perhaps relevant.  It implies that
this is totally unrelated to any server-side behavior; so if it's possible
for us to work around it at all, we'd have to do so client-side.
        regards, tom lane

Re: Possible fix for occasional failures on castoroides etc

From

Andres Freund

Date:

03 May 2014, 19:29:36

On 2014-05-03 13:25:32 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2012-09-17 08:23:01 -0400, Dave Page wrote:
> >> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.
> 
> > I've just noticed (while checking whether backporting 4c8aa8b5aea caused
> > problems) that this doesn't seem to have fixed the issue. One further
> > thing to try would be to try whether tcp connections don't have the same
> > problem.
> 
> I did some googling on this, and found out that people have seen identical
> behavior on Solaris with mysql and other products, so at least we're not
> alone.

Yea, I found a couple report of that as well.

>  Googling also reminded me that we could have a look at the source
> (duh), which is still available from hg.openindiana.org.

I didn't get that far ;)

I think we should try whether the problem disappears if tcp connections
are used. That ought to be much more heavily used in the real
world. Thus less likely to be buggy.

While It's not documented as such, passing --host=localhost to
pg_regress seems to have the desired effect. Dave, could you make your
animal specify that?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Possible fix for occasional failures on castoroides etc

From

Dave Page

Date:

06 May 2014, 08:36:23

On Sat, May 3, 2014 at 8:29 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-05-03 13:25:32 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>> > On 2012-09-17 08:23:01 -0400, Dave Page wrote:
>> >> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.
>>
>> > I've just noticed (while checking whether backporting 4c8aa8b5aea caused
>> > problems) that this doesn't seem to have fixed the issue. One further
>> > thing to try would be to try whether tcp connections don't have the same
>> > problem.
>>
>> I did some googling on this, and found out that people have seen identical
>> behavior on Solaris with mysql and other products, so at least we're not
>> alone.
>
> Yea, I found a couple report of that as well.
>
>>  Googling also reminded me that we could have a look at the source
>> (duh), which is still available from hg.openindiana.org.
>
> I didn't get that far ;)
>
> I think we should try whether the problem disappears if tcp connections
> are used. That ought to be much more heavily used in the real
> world. Thus less likely to be buggy.
>
> While It's not documented as such, passing --host=localhost to
> pg_regress seems to have the desired effect. Dave, could you make your
> animal specify that?

I've added:

EXTRA_REGRESS_OPTS => '--host=localhost',

to the build_env setting for both animals.


-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Possible fix for occasional failures on castoroides etc

From

Tom Lane

Date:

18 May 2014, 05:35:19

Dave Page <dpage@pgadmin.org> writes:
> On Sat, May 3, 2014 at 8:29 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2012-09-17 08:23:01 -0400, Dave Page wrote:
>>> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.

>> I've just noticed (while checking whether backporting 4c8aa8b5aea caused
>> problems) that this doesn't seem to have fixed the issue. One further
>> thing to try would be to try whether tcp connections don't have the same
>> problem.

> I've added:
> EXTRA_REGRESS_OPTS => '--host=localhost',
> to the build_env setting for both animals.

According to
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurus&dt=2014-05-16%2014%3A27%3A58
this did not fix the problem; however, the failure is

! psql: could not connect to server: Connection refused
!     Is the server running locally and accepting
!     connections on Unix domain socket "/tmp/.s.PGSQL.57345"?

which shows that this configuration change did not actually have the
desired effect of forcing the regression tests to be run across TCP.
I'm too tired to check into what *would* force that.
        regards, tom lane

Re: Possible fix for occasional failures on castoroides etc

From

Andres Freund

Date:

18 May 2014, 05:55:58

On 2014-05-18 01:35:04 -0400, Tom Lane wrote:
> Dave Page <dpage@pgadmin.org> writes:
> > On Sat, May 3, 2014 at 8:29 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> On 2012-09-17 08:23:01 -0400, Dave Page wrote:
> >>> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus.
> 
> >> I've just noticed (while checking whether backporting 4c8aa8b5aea caused
> >> problems) that this doesn't seem to have fixed the issue. One further
> >> thing to try would be to try whether tcp connections don't have the same
> >> problem.
> 
> > I've added:
> > EXTRA_REGRESS_OPTS => '--host=localhost',
> > to the build_env setting for both animals.
> 
> According to
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurus&dt=2014-05-16%2014%3A27%3A58
> this did not fix the problem; however, the failure is
> 
> ! psql: could not connect to server: Connection refused
> !     Is the server running locally and accepting
> !     connections on Unix domain socket "/tmp/.s.PGSQL.57345"?
> 
> which shows that this configuration change did not actually have the
> desired effect of forcing the regression tests to be run across TCP.
> I'm too tired to check into what *would* force that.

I think that's just because EXTRA_REGRESS_OPTS is fairly new
(19fa6161dd6ba85b6c88b3476d165745dd5192d9). No idea if there's a nice
way to pass options to the pg_regress invocations of buildfarm animals.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services