Thread: AW: Re: SOMAXCONN (was Re: Solaris source code)

AW: Re: SOMAXCONN (was Re: Solaris source code)

From
Zeugswetter Andreas SB
Date:
> When the system is too heavily loaded (however measured), any further 
> login attempts will fail.  What I suggested is, instead of the 
> postmaster accept()ing the connection, why not leave the connection 
> attempt in the queue until we can afford a back end to handle it?  

Because the clients would time out ?

> Then, the argument to listen() will determine how many attempts can 
> be in the queue before the network stack itself rejects them without 
> the postmaster involved.

You cannot change the argument to listen() at runtime, or are you suggesting
to close and reopen the socket when maxbackends is reached ? I think 
that would be nonsense.

I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in 
accepting more than your total allowed connections concurrently.

Andreas


Re: AW: Re: SOMAXCONN (was Re: Solaris source code)

From
mlw
Date:
Zeugswetter Andreas SB wrote:
> 
> > When the system is too heavily loaded (however measured), any further
> > login attempts will fail.  What I suggested is, instead of the
> > postmaster accept()ing the connection, why not leave the connection
> > attempt in the queue until we can afford a back end to handle it?
> 
> Because the clients would time out ?
> 
> > Then, the argument to listen() will determine how many attempts can
> > be in the queue before the network stack itself rejects them without
> > the postmaster involved.
> 
> You cannot change the argument to listen() at runtime, or are you suggesting
> to close and reopen the socket when maxbackends is reached ? I think
> that would be nonsense.
> 
> I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in
> accepting more than your total allowed connections concurrently.
> 
> Andreas

I have been following this thread and I am confused why the queue argument to
listen() has anything to do with Max backends. All the parameter to listen does
is specify how long a list of sockets open and waiting for connection can be.
It has nothing to do with the number of back end sockets which are open.

If you have a limit of 128 back end connections, and you have 127 of them open.
A listen with queue size of 128 will still allow 128 sockets to wait for
connection before turning others away. 

It should be a parameter based on the time out of a socket connection vs the
ability to answer connection requests within that period of time. 

There are two was to think about this. Either you make this parameter tunable
to give a proper estimate of the usability of the system, i.e. tailor the
listen queue parameter to reject sockets when some number of sockets are
waiting, or you say no one should ever be denied, accept everyone and let them
time out if we are not fast enough.

This debate could go on, why not make it a parameter in the config file that
defaults to some system variable, i.e. SOMAXCONN.

BTW: on linux, the backlog queue parameter is silently truncated to 128 anyway.


Re: Re: SOMAXCONN

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jul 13, 2001 at 10:36:13AM +0200, Zeugswetter Andreas SB wrote:
> 
> > When the system is too heavily loaded (however measured), any further 
> > login attempts will fail.  What I suggested is, instead of the 
> > postmaster accept()ing the connection, why not leave the connection 
> > attempt in the queue until we can afford a back end to handle it?  
> 
> Because the clients would time out ?

It takes a long time for half-open connections to time out, by default.
Probably most clients would time out, themselves, first, if PG took too
long to get to them.  That would be a Good Thing.

Once the SOMAXCONN threshold is reached (which would only happen when 
the system is very heavily loaded, because when it's not then nothing 
stays in the queue for long), new connection attempts would fail 
immediately, another Good Thing.  When the system is very heavily 
loaded, we don't want to spare attention for clients we can't serve.

> > Then, the argument to listen() will determine how many attempts can 
> > be in the queue before the network stack itself rejects them without 
> > the postmaster involved.
> 
> You cannot change the argument to listen() at runtime, or are you suggesting
> to close and reopen the socket when maxbackends is reached ? I think 
> that would be nonsense.

Of course that would not work, and indeed nobody suggested it.

If postmaster behaved a little differently, not accept()ing when
the system is too heavily loaded, then it would be reasonable to
call listen() (once!) with PG_SOMAXCONN set to (e.g.) N=20.  

Where the system is not too heavily-loaded, the postmaster accept()s
the connection attempts from the queue very quickly, and the number
of half-open connections never builds up to N.  (This is how PG has
been running already, under light load -- except that on Solaris with 
Unix sockets N has been too small.)

When the system *is* heavily loaded, the first N attempts would be 
queued, and then the OS would automatically reject the rest.  This 
is better than accept()ing any number of attempts and then refusing 
to authenticate.  The N half-open connections in the queue would be 
picked up by postmaster as existing back ends drop off, or time out 
and give up if that happens too slowly.  

> I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no
> use in accepting more than your total allowed connections concurrently.

That might not have the effect you imagine, where many short-lived
connections are being made.  In some cases it would mean that clients 
are rejected that could have been served after a very short delay.

Nathan Myers
ncm@zembu.com


Re: SOMAXCONN (was Re: Solaris source code)

From
ncm@zembu.com (Nathan Myers)
Date:
On Fri, Jul 13, 2001 at 07:53:02AM -0400, mlw wrote:
> Zeugswetter Andreas SB wrote:
> > I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in
> > accepting more than your total allowed connections concurrently.
> 
> I have been following this thread and I am confused why the queue
> argument to listen() has anything to do with Max backends. All the
> parameter to listen does is specify how long a list of sockets open
> and waiting for connection can be. It has nothing to do with the
> number of back end sockets which are open.

Correct.

> If you have a limit of 128 back end connections, and you have 127
> of them open, a listen with queue size of 128 will still allow 128
> sockets to wait for connection before turning others away.

Correct.

> It should be a parameter based on the time out of a socket connection
> vs the ability to answer connection requests within that period of
> time.

It's not really meaningful at all, at present.

> There are two was to think about this. Either you make this parameter
> tunable to give a proper estimate of the usability of the system, i.e.
> tailor the listen queue parameter to reject sockets when some number
> of sockets are waiting, or you say no one should ever be denied,
> accept everyone and let them time out if we are not fast enough.
>
> This debate could go on, why not make it a parameter in the config
> file that defaults to some system variable, i.e. SOMAXCONN.

With postmaster's current behavior there is no benefit in setting
the listen() argument to anything less than 1000.  With a small
change in postmaster behavior, a tunable system variable becomes
useful.

But using SOMAXCONN blindly is always wrong; that is often 5, which
is demonstrably too small.

> BTW: on linux, the backlog queue parameter is silently truncated to
> 128 anyway.

The 128 limit is common, applied on BSD and Solaris as well.
It will probably increase in future releases.

Nathan Myers
ncm@zembu.com


Re: SOMAXCONN (was Re: Solaris source code)

From
mlw
Date:
Nathan Myers wrote:
> > There are two was to think about this. Either you make this parameter
> > tunable to give a proper estimate of the usability of the system, i.e.
> > tailor the listen queue parameter to reject sockets when some number
> > of sockets are waiting, or you say no one should ever be denied,
> > accept everyone and let them time out if we are not fast enough.
> >
> > This debate could go on, why not make it a parameter in the config
> > file that defaults to some system variable, i.e. SOMAXCONN.
> 
> With postmaster's current behavior there is no benefit in setting
> the listen() argument to anything less than 1000.  With a small
> change in postmaster behavior, a tunable system variable becomes
> useful.
> 
> But using SOMAXCONN blindly is always wrong; that is often 5, which
> is demonstrably too small.

It is rumored that many BSD version are limited to 5.
> 
> > BTW: on linux, the backlog queue parameter is silently truncated to
> > 128 anyway.
> 
> The 128 limit is common, applied on BSD and Solaris as well.
> It will probably increase in future releases.

This point I am trying to make is that the parameter passed to listen() is OS
dependent, on both what it means and its defaults. Trying to tie this to
maxbackends is not the right thought process. It has nothing to do, at all,
with maxbackends.

Passing listen(5) would probably be sufficient for Postgres. Will there ever be
5 sockets in the listen() queue prior to "accept()?" probably not.  SOMAXCONN
is a system limit, setting a listen() value greater than this, is probably
silently adjusted down to the defined SOMAXCONN.

By making it a parameter, and defaulting to SOMAXCONN, this allows the maximum
number of connections a system can handle, while still allowing the DBA to fine
tune connection behavior on high load systems.


Re: Re: SOMAXCONN (was Re: Solaris source code)

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> Nathan Myers wrote:
>> But using SOMAXCONN blindly is always wrong; that is often 5, which
>> is demonstrably too small.

> It is rumored that many BSD version are limited to 5.

BSD systems tend to claim SOMAXCONN = 5 in the header files, but *not*
to have such a small limit in the kernel.  The real step forward that
we have made in this discussion is to realize that we cannot trust
<sys/socket.h> to tell us what the kernel limit actually is.

> Passing listen(5) would probably be sufficient for Postgres.

It demonstrably is not sufficient.  Set it that way in pqcomm.c
and run the parallel regression tests.  Watch them fail.
        regards, tom lane


Re: SOMAXCONN (was Re: Solaris source code)

From
mlw
Date:
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > Nathan Myers wrote:
> >> But using SOMAXCONN blindly is always wrong; that is often 5, which
> >> is demonstrably too small.
> 
> > It is rumored that many BSD version are limited to 5.
> 
> BSD systems tend to claim SOMAXCONN = 5 in the header files, but *not*
> to have such a small limit in the kernel.  The real step forward that
> we have made in this discussion is to realize that we cannot trust
> <sys/socket.h> to tell us what the kernel limit actually is.
> 
> > Passing listen(5) would probably be sufficient for Postgres.
> 
> It demonstrably is not sufficient.  Set it that way in pqcomm.c
> and run the parallel regression tests.  Watch them fail.
>

That's interesting, I would not have guessed that. I have written a number of
server applications which can handle, litterally, over a thousand
connection/operations a second, which only has a listen(5). (I do have it as a
configuration parameter, but have never seen a time when I have had to change
it.)

I figured the closest one could come to an expert in all things socket related
would have to be the Apache web server source. They have a different take on
the listen() parameter:

>>>>> from httpd.h >>>>>>>>>>>   402 /* The maximum length of the queue of pending connections, as defined   403  * by
listen(2). Under some systems, it should be increased if you   404  * are experiencing a heavy TCP SYN flood attack.
405 *   406  * It defaults to 511 instead of 512 because some systems store it   407  * as an 8-bit datatype; 512
truncatedto 8-bits is 0, while 511 is   408  * 255 when truncated.   409  */   410   411 #ifndef DEFAULT_LISTENBACKLOG
412 #define DEFAULT_LISTENBACKLOG 511   413 #endif
 
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

I have not found any other location in which DEFAULT_LISTENBACKLOG is defined,
but it is a configuration parameter, and here is what the Apache docs claim:

>>>>>>>>>>>> http://httpd.apache.org/docs/mod/core.html >>>>>>>>>>>>

ListenBacklog directive

Syntax: ListenBacklog backlog
Default: ListenBacklog 511
Context: server config
Status: Core
Compatibility: ListenBacklog is only available in Apache versions after 1.2.0. 

The maximum length of the queue of pending connections. Generally no tuning is
needed or desired, however on some systems it is desirable to increase this
when under a TCP SYN flood attack. See the backlog parameter to the listen(2)
system call. 

This will often be limited to a smaller number by the operating system. This
varies from OS to OS. Also note that many OSes do not use exactly what is
specified as the backlog, but use a number based on (but normally larger than)
what is set.
<<<<<<<<<<<<<<<<<<<<<<<

Anyway, why not just do what apache does, set it to some extreme default
setting, which even when truncated, is still pretty big, and allow the end user
to change this value in postgresql.conf.


Re: SOMAXCONN (was Re: Solaris source code)

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> Tom Lane wrote:
>>> Passing listen(5) would probably be sufficient for Postgres.
>> 
>> It demonstrably is not sufficient.  Set it that way in pqcomm.c
>> and run the parallel regression tests.  Watch them fail.

> That's interesting, I would not have guessed that. I have written a number of
> server applications which can handle, litterally, over a thousand
> connection/operations a second, which only has a listen(5).

The problem should be considerably reduced in latest sources, since
as of a week or three ago, the top postmaster process' outer loop is
basically just accept() and fork() --- client authentication is now
handled after the fork, instead of before.  Still, we now know that
(a) SOMAXCONN is a lie on many systems, and (b) values as small as 5
are pushing our luck, even though it might not fail so easily anymore.

The state of affairs in current sources is that the listen queue
parameter is MIN(MaxBackends * 2, PG_SOMAXCONN), where PG_SOMAXCONN
is a constant defined in config.h --- it's 10000, hence a non-factor,
by default, but could be reduced if you have a kernel that doesn't
cope well with large listen-queue requests.  We probably won't know
if there are any such systems until we get some field experience with
the new code, but we could have "configure" select a platform-dependent
value if we find such problems.

I believe that this is fine and doesn't need any further tweaking,
pending field experience.  What's still open for discussion is Nathan's
thought that the postmaster ought to stop issuing accept() calls once
it has so many children that it will refuse to fork any more.  I was
initially against that, but on further reflection I think it might be
a good idea after all, because of another recent change related to the
authenticate-after-fork change.  Since the top postmaster doesn't really
know which children have become working backends and which are still
engaged in authentication dialogs, it cannot enforce the MaxBackends
limit directly.  Instead, MaxBackends is checked when the child process
is done with authentication and is trying to join the PROC pool in
shared memory.  The postmaster will spawn up to 2 * MaxBackends child
processes before refusing to spawn more --- this allows there to be
up to MaxBackends children engaged in auth dialog but not yet working
backends.  (It's reasonable to allow extra children since some may fail
the auth dialog, or an extant backend may have quit by the time they
finish auth dialog.  Whether 2*MaxBackends is the best choice is
debatable, but that's what we're using at the moment.)

Furthermore, we intend to install a pretty tight timeout on the overall
time spent in auth phase (a few seconds I imagine, although we haven't
yet discussed that number either).

Given this setup, if the postmaster has reached its max-children limit
then it can be quite certain that at least some of those children will
quit within approximately the auth timeout interval.  Therefore, not
accept()ing is a state that will probably *not* persist for long enough
to cause the new clients to timeout.  By not accept()ing at a time when
we wouldn't fork, we can convert the behavior clients see at peak load
from quick rejection into a short delay before authentication dialog.

Of course, if you are at MaxBackends working backends, then the new
client is still going to get a "too many clients" error; all we have
accomplished with the change is to expend a fork() and an authentication
cycle before issuing the error.  So if the intent is to reduce overall
system load then this isn't necessarily an improvement.

IIRC, the rationale for using 2*MaxBackends as the maximum child count
was to make it unlikely that the postmaster would refuse to fork; given
a short auth timeout it's unlikely that as many as MaxBackends clients
will be engaged in auth dialog at any instant.  So unless we tighten
that max child count considerably, holding off accept() at max child
count is unlikely to change the behavior under any but worst-case
scenarios anyway.  And in a worst-case scenario, shedding load by
rejecting connections quickly is probably just what you want to do.

So, having thought that through, I'm still of the opinion that holding
off accept is of little or no benefit to us.  But it's not as simple
as it looks at first glance.  Anyone have a different take on what the
behavior is likely to be?
        regards, tom lane


Re: Re: SOMAXCONN (was Re: Solaris source code)

From
ncm@zembu.com (Nathan Myers)
Date:
On Sat, Jul 14, 2001 at 11:38:51AM -0400, Tom Lane wrote:
> 
> The state of affairs in current sources is that the listen queue
> parameter is MIN(MaxBackends * 2, PG_SOMAXCONN), where PG_SOMAXCONN
> is a constant defined in config.h --- it's 10000, hence a non-factor,
> by default, but could be reduced if you have a kernel that doesn't
> cope well with large listen-queue requests.  We probably won't know
> if there are any such systems until we get some field experience with
> the new code, but we could have "configure" select a platform-dependent
> value if we find such problems.

Considering the Apache comment about some systems truncating instead
of limiting... 10000&0xff is 16.  Maybe 10239 would be a better choice, 
or 16383.  

> So, having thought that through, I'm still of the opinion that holding
> off accept is of little or no benefit to us.  But it's not as simple
> as it looks at first glance.  Anyone have a different take on what the
> behavior is likely to be?

After doing some more reading, I find that most OSes do not reject
connect requests that would exceed the specified backlog; instead,
they ignore the connection request and assume the client will retry 
later.  Therefore, it appears cannot use a small backlog to shed load 
unless we assume that clients will time out quickly by themselves.

OTOH, maybe it's reasonable to assume that clients will time out,
and that in the normal case authentication happens quickly.

Then we can use a small listen() backlog, and never accept() if we
have more than MaxBackend back ends.  The OS will keep a small queue
corresponding to our small backlog, and the clients will do our load 
shedding for us.

Nathan Myers
ncm@zembu.com


Re: Re: SOMAXCONN (was Re: Solaris source code)

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> Considering the Apache comment about some systems truncating instead
> of limiting... 10000&0xff is 16.  Maybe 10239 would be a better choice, 
> or 16383.  

Hmm.  If the Apache comment is real, then that would not help on those
systems.  Remember that the actual listen request is going to be
2*MaxBackends in practically all cases.  The only thing that would save
you from getting an unexpectedly small backlog parameter in such a case
is to set PG_SOMAXCONN to 255.

Perhaps we should just do that and not worry about whether the Apache
info is accurate or not.  But I'd kind of like to see chapter and verse,
ie, at least one specific system that demonstrably fails to perform the
clamp-to-255 for itself, before we lobotomize the code that way.  ISTM a
conformant implementation of listen() would limit the given value to 255
before storing it into an 8-bit field, not just lose high order bits.


> After doing some more reading, I find that most OSes do not reject
> connect requests that would exceed the specified backlog; instead,
> they ignore the connection request and assume the client will retry 
> later.  Therefore, it appears cannot use a small backlog to shed load 
> unless we assume that clients will time out quickly by themselves.

Hm.  newgate is a machine on my local net that's not currently up.

$ time psql -h newgate postgres
psql: could not connect to server: Connection timed out       Is the server running on host newgate and accepting
TCP/IPconnections on port 5432?
 

real    1m13.33s
user    0m0.02s
sys     0m0.01s
$

That's on HPUX 10.20.  On an old Linux distro, the same timeout
seems to be about 21 seconds, which is still pretty long by some
standards.  Do the TCP specs recommend anything particular about
no-response-to-SYN timeouts?
        regards, tom lane