Re: Possible fix for occasional failures on castoroides etc - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Possible fix for occasional failures on castoroides etc
Date
Msg-id 31133.1399143568@sss.pgh.pa.us
Whole thread Raw
In response to Re: Possible fix for occasional failures on castoroides etc  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> Unfortunately, it seems the Solaris implementors didn't read Stevens,
> because it looks to me like they *do* return ECONNREFUSED on accept queue
> overflow.  Still, it's hard to see how that would be the issue if we're
> still seeing this failure with only five clients.

Also, after further inspection of the source code, it appears to me that
the kernel's limit on accept queue length is hard-wired at 4096 in
Solaris.  So there's basically no way that we're hitting that limit in the
regression tests, and the MAX_CONNECTIONS configuration is irrelevant.

We seem to be left with the race condition theory.  In that connection,
this comment in /usr/src/uts/common/io/tl.c is interesting:
*    The T_CONN_CON is generated when processing the T_CONN_REQ i.e. before*    a T_CONN_RES is received from the
acceptor.This means that a socket*    connect will complete before the peer has called accept.
 

I'm not sure that explains anything of value, but it's probably unlike any
other implementation, which makes it perhaps relevant.  It implies that
this is totally unrelated to any server-side behavior; so if it's possible
for us to work around it at all, we'd have to do so client-side.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: pgindent run
Next
From: Andres Freund
Date:
Subject: Re: Possible fix for occasional failures on castoroides etc