Thread: AW: Re: SOMAXCONN (was Re: Solaris source code)
> When the system is too heavily loaded (however measured), any further > login attempts will fail. What I suggested is, instead of the > postmaster accept()ing the connection, why not leave the connection > attempt in the queue until we can afford a back end to handle it? Because the clients would time out ? > Then, the argument to listen() will determine how many attempts can > be in the queue before the network stack itself rejects them without > the postmaster involved. You cannot change the argument to listen() at runtime, or are you suggesting to close and reopen the socket when maxbackends is reached ? I think that would be nonsense. I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in accepting more than your total allowed connections concurrently. Andreas
Zeugswetter Andreas SB wrote: > > > When the system is too heavily loaded (however measured), any further > > login attempts will fail. What I suggested is, instead of the > > postmaster accept()ing the connection, why not leave the connection > > attempt in the queue until we can afford a back end to handle it? > > Because the clients would time out ? > > > Then, the argument to listen() will determine how many attempts can > > be in the queue before the network stack itself rejects them without > > the postmaster involved. > > You cannot change the argument to listen() at runtime, or are you suggesting > to close and reopen the socket when maxbackends is reached ? I think > that would be nonsense. > > I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in > accepting more than your total allowed connections concurrently. > > Andreas I have been following this thread and I am confused why the queue argument to listen() has anything to do with Max backends. All the parameter to listen does is specify how long a list of sockets open and waiting for connection can be. It has nothing to do with the number of back end sockets which are open. If you have a limit of 128 back end connections, and you have 127 of them open. A listen with queue size of 128 will still allow 128 sockets to wait for connection before turning others away. It should be a parameter based on the time out of a socket connection vs the ability to answer connection requests within that period of time. There are two was to think about this. Either you make this parameter tunable to give a proper estimate of the usability of the system, i.e. tailor the listen queue parameter to reject sockets when some number of sockets are waiting, or you say no one should ever be denied, accept everyone and let them time out if we are not fast enough. This debate could go on, why not make it a parameter in the config file that defaults to some system variable, i.e. SOMAXCONN. BTW: on linux, the backlog queue parameter is silently truncated to 128 anyway.
On Fri, Jul 13, 2001 at 10:36:13AM +0200, Zeugswetter Andreas SB wrote: > > > When the system is too heavily loaded (however measured), any further > > login attempts will fail. What I suggested is, instead of the > > postmaster accept()ing the connection, why not leave the connection > > attempt in the queue until we can afford a back end to handle it? > > Because the clients would time out ? It takes a long time for half-open connections to time out, by default. Probably most clients would time out, themselves, first, if PG took too long to get to them. That would be a Good Thing. Once the SOMAXCONN threshold is reached (which would only happen when the system is very heavily loaded, because when it's not then nothing stays in the queue for long), new connection attempts would fail immediately, another Good Thing. When the system is very heavily loaded, we don't want to spare attention for clients we can't serve. > > Then, the argument to listen() will determine how many attempts can > > be in the queue before the network stack itself rejects them without > > the postmaster involved. > > You cannot change the argument to listen() at runtime, or are you suggesting > to close and reopen the socket when maxbackends is reached ? I think > that would be nonsense. Of course that would not work, and indeed nobody suggested it. If postmaster behaved a little differently, not accept()ing when the system is too heavily loaded, then it would be reasonable to call listen() (once!) with PG_SOMAXCONN set to (e.g.) N=20. Where the system is not too heavily-loaded, the postmaster accept()s the connection attempts from the queue very quickly, and the number of half-open connections never builds up to N. (This is how PG has been running already, under light load -- except that on Solaris with Unix sockets N has been too small.) When the system *is* heavily loaded, the first N attempts would be queued, and then the OS would automatically reject the rest. This is better than accept()ing any number of attempts and then refusing to authenticate. The N half-open connections in the queue would be picked up by postmaster as existing back ends drop off, or time out and give up if that happens too slowly. > I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no > use in accepting more than your total allowed connections concurrently. That might not have the effect you imagine, where many short-lived connections are being made. In some cases it would mean that clients are rejected that could have been served after a very short delay. Nathan Myers ncm@zembu.com
On Fri, Jul 13, 2001 at 07:53:02AM -0400, mlw wrote: > Zeugswetter Andreas SB wrote: > > I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in > > accepting more than your total allowed connections concurrently. > > I have been following this thread and I am confused why the queue > argument to listen() has anything to do with Max backends. All the > parameter to listen does is specify how long a list of sockets open > and waiting for connection can be. It has nothing to do with the > number of back end sockets which are open. Correct. > If you have a limit of 128 back end connections, and you have 127 > of them open, a listen with queue size of 128 will still allow 128 > sockets to wait for connection before turning others away. Correct. > It should be a parameter based on the time out of a socket connection > vs the ability to answer connection requests within that period of > time. It's not really meaningful at all, at present. > There are two was to think about this. Either you make this parameter > tunable to give a proper estimate of the usability of the system, i.e. > tailor the listen queue parameter to reject sockets when some number > of sockets are waiting, or you say no one should ever be denied, > accept everyone and let them time out if we are not fast enough. > > This debate could go on, why not make it a parameter in the config > file that defaults to some system variable, i.e. SOMAXCONN. With postmaster's current behavior there is no benefit in setting the listen() argument to anything less than 1000. With a small change in postmaster behavior, a tunable system variable becomes useful. But using SOMAXCONN blindly is always wrong; that is often 5, which is demonstrably too small. > BTW: on linux, the backlog queue parameter is silently truncated to > 128 anyway. The 128 limit is common, applied on BSD and Solaris as well. It will probably increase in future releases. Nathan Myers ncm@zembu.com
Nathan Myers wrote: > > There are two was to think about this. Either you make this parameter > > tunable to give a proper estimate of the usability of the system, i.e. > > tailor the listen queue parameter to reject sockets when some number > > of sockets are waiting, or you say no one should ever be denied, > > accept everyone and let them time out if we are not fast enough. > > > > This debate could go on, why not make it a parameter in the config > > file that defaults to some system variable, i.e. SOMAXCONN. > > With postmaster's current behavior there is no benefit in setting > the listen() argument to anything less than 1000. With a small > change in postmaster behavior, a tunable system variable becomes > useful. > > But using SOMAXCONN blindly is always wrong; that is often 5, which > is demonstrably too small. It is rumored that many BSD version are limited to 5. > > > BTW: on linux, the backlog queue parameter is silently truncated to > > 128 anyway. > > The 128 limit is common, applied on BSD and Solaris as well. > It will probably increase in future releases. This point I am trying to make is that the parameter passed to listen() is OS dependent, on both what it means and its defaults. Trying to tie this to maxbackends is not the right thought process. It has nothing to do, at all, with maxbackends. Passing listen(5) would probably be sufficient for Postgres. Will there ever be 5 sockets in the listen() queue prior to "accept()?" probably not. SOMAXCONN is a system limit, setting a listen() value greater than this, is probably silently adjusted down to the defined SOMAXCONN. By making it a parameter, and defaulting to SOMAXCONN, this allows the maximum number of connections a system can handle, while still allowing the DBA to fine tune connection behavior on high load systems.
mlw <markw@mohawksoft.com> writes: > Nathan Myers wrote: >> But using SOMAXCONN blindly is always wrong; that is often 5, which >> is demonstrably too small. > It is rumored that many BSD version are limited to 5. BSD systems tend to claim SOMAXCONN = 5 in the header files, but *not* to have such a small limit in the kernel. The real step forward that we have made in this discussion is to realize that we cannot trust <sys/socket.h> to tell us what the kernel limit actually is. > Passing listen(5) would probably be sufficient for Postgres. It demonstrably is not sufficient. Set it that way in pqcomm.c and run the parallel regression tests. Watch them fail. regards, tom lane
Tom Lane wrote: > > mlw <markw@mohawksoft.com> writes: > > Nathan Myers wrote: > >> But using SOMAXCONN blindly is always wrong; that is often 5, which > >> is demonstrably too small. > > > It is rumored that many BSD version are limited to 5. > > BSD systems tend to claim SOMAXCONN = 5 in the header files, but *not* > to have such a small limit in the kernel. The real step forward that > we have made in this discussion is to realize that we cannot trust > <sys/socket.h> to tell us what the kernel limit actually is. > > > Passing listen(5) would probably be sufficient for Postgres. > > It demonstrably is not sufficient. Set it that way in pqcomm.c > and run the parallel regression tests. Watch them fail. > That's interesting, I would not have guessed that. I have written a number of server applications which can handle, litterally, over a thousand connection/operations a second, which only has a listen(5). (I do have it as a configuration parameter, but have never seen a time when I have had to change it.) I figured the closest one could come to an expert in all things socket related would have to be the Apache web server source. They have a different take on the listen() parameter: >>>>> from httpd.h >>>>>>>>>>> 402 /* The maximum length of the queue of pending connections, as defined 403 * by listen(2). Under some systems, it should be increased if you 404 * are experiencing a heavy TCP SYN flood attack. 405 * 406 * It defaults to 511 instead of 512 because some systems store it 407 * as an 8-bit datatype; 512 truncatedto 8-bits is 0, while 511 is 408 * 255 when truncated. 409 */ 410 411 #ifndef DEFAULT_LISTENBACKLOG 412 #define DEFAULT_LISTENBACKLOG 511 413 #endif <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< I have not found any other location in which DEFAULT_LISTENBACKLOG is defined, but it is a configuration parameter, and here is what the Apache docs claim: >>>>>>>>>>>> http://httpd.apache.org/docs/mod/core.html >>>>>>>>>>>> ListenBacklog directive Syntax: ListenBacklog backlog Default: ListenBacklog 511 Context: server config Status: Core Compatibility: ListenBacklog is only available in Apache versions after 1.2.0. The maximum length of the queue of pending connections. Generally no tuning is needed or desired, however on some systems it is desirable to increase this when under a TCP SYN flood attack. See the backlog parameter to the listen(2) system call. This will often be limited to a smaller number by the operating system. This varies from OS to OS. Also note that many OSes do not use exactly what is specified as the backlog, but use a number based on (but normally larger than) what is set. <<<<<<<<<<<<<<<<<<<<<<< Anyway, why not just do what apache does, set it to some extreme default setting, which even when truncated, is still pretty big, and allow the end user to change this value in postgresql.conf.
mlw <markw@mohawksoft.com> writes: > Tom Lane wrote: >>> Passing listen(5) would probably be sufficient for Postgres. >> >> It demonstrably is not sufficient. Set it that way in pqcomm.c >> and run the parallel regression tests. Watch them fail. > That's interesting, I would not have guessed that. I have written a number of > server applications which can handle, litterally, over a thousand > connection/operations a second, which only has a listen(5). The problem should be considerably reduced in latest sources, since as of a week or three ago, the top postmaster process' outer loop is basically just accept() and fork() --- client authentication is now handled after the fork, instead of before. Still, we now know that (a) SOMAXCONN is a lie on many systems, and (b) values as small as 5 are pushing our luck, even though it might not fail so easily anymore. The state of affairs in current sources is that the listen queue parameter is MIN(MaxBackends * 2, PG_SOMAXCONN), where PG_SOMAXCONN is a constant defined in config.h --- it's 10000, hence a non-factor, by default, but could be reduced if you have a kernel that doesn't cope well with large listen-queue requests. We probably won't know if there are any such systems until we get some field experience with the new code, but we could have "configure" select a platform-dependent value if we find such problems. I believe that this is fine and doesn't need any further tweaking, pending field experience. What's still open for discussion is Nathan's thought that the postmaster ought to stop issuing accept() calls once it has so many children that it will refuse to fork any more. I was initially against that, but on further reflection I think it might be a good idea after all, because of another recent change related to the authenticate-after-fork change. Since the top postmaster doesn't really know which children have become working backends and which are still engaged in authentication dialogs, it cannot enforce the MaxBackends limit directly. Instead, MaxBackends is checked when the child process is done with authentication and is trying to join the PROC pool in shared memory. The postmaster will spawn up to 2 * MaxBackends child processes before refusing to spawn more --- this allows there to be up to MaxBackends children engaged in auth dialog but not yet working backends. (It's reasonable to allow extra children since some may fail the auth dialog, or an extant backend may have quit by the time they finish auth dialog. Whether 2*MaxBackends is the best choice is debatable, but that's what we're using at the moment.) Furthermore, we intend to install a pretty tight timeout on the overall time spent in auth phase (a few seconds I imagine, although we haven't yet discussed that number either). Given this setup, if the postmaster has reached its max-children limit then it can be quite certain that at least some of those children will quit within approximately the auth timeout interval. Therefore, not accept()ing is a state that will probably *not* persist for long enough to cause the new clients to timeout. By not accept()ing at a time when we wouldn't fork, we can convert the behavior clients see at peak load from quick rejection into a short delay before authentication dialog. Of course, if you are at MaxBackends working backends, then the new client is still going to get a "too many clients" error; all we have accomplished with the change is to expend a fork() and an authentication cycle before issuing the error. So if the intent is to reduce overall system load then this isn't necessarily an improvement. IIRC, the rationale for using 2*MaxBackends as the maximum child count was to make it unlikely that the postmaster would refuse to fork; given a short auth timeout it's unlikely that as many as MaxBackends clients will be engaged in auth dialog at any instant. So unless we tighten that max child count considerably, holding off accept() at max child count is unlikely to change the behavior under any but worst-case scenarios anyway. And in a worst-case scenario, shedding load by rejecting connections quickly is probably just what you want to do. So, having thought that through, I'm still of the opinion that holding off accept is of little or no benefit to us. But it's not as simple as it looks at first glance. Anyone have a different take on what the behavior is likely to be? regards, tom lane
On Sat, Jul 14, 2001 at 11:38:51AM -0400, Tom Lane wrote: > > The state of affairs in current sources is that the listen queue > parameter is MIN(MaxBackends * 2, PG_SOMAXCONN), where PG_SOMAXCONN > is a constant defined in config.h --- it's 10000, hence a non-factor, > by default, but could be reduced if you have a kernel that doesn't > cope well with large listen-queue requests. We probably won't know > if there are any such systems until we get some field experience with > the new code, but we could have "configure" select a platform-dependent > value if we find such problems. Considering the Apache comment about some systems truncating instead of limiting... 10000&0xff is 16. Maybe 10239 would be a better choice, or 16383. > So, having thought that through, I'm still of the opinion that holding > off accept is of little or no benefit to us. But it's not as simple > as it looks at first glance. Anyone have a different take on what the > behavior is likely to be? After doing some more reading, I find that most OSes do not reject connect requests that would exceed the specified backlog; instead, they ignore the connection request and assume the client will retry later. Therefore, it appears cannot use a small backlog to shed load unless we assume that clients will time out quickly by themselves. OTOH, maybe it's reasonable to assume that clients will time out, and that in the normal case authentication happens quickly. Then we can use a small listen() backlog, and never accept() if we have more than MaxBackend back ends. The OS will keep a small queue corresponding to our small backlog, and the clients will do our load shedding for us. Nathan Myers ncm@zembu.com
ncm@zembu.com (Nathan Myers) writes: > Considering the Apache comment about some systems truncating instead > of limiting... 10000&0xff is 16. Maybe 10239 would be a better choice, > or 16383. Hmm. If the Apache comment is real, then that would not help on those systems. Remember that the actual listen request is going to be 2*MaxBackends in practically all cases. The only thing that would save you from getting an unexpectedly small backlog parameter in such a case is to set PG_SOMAXCONN to 255. Perhaps we should just do that and not worry about whether the Apache info is accurate or not. But I'd kind of like to see chapter and verse, ie, at least one specific system that demonstrably fails to perform the clamp-to-255 for itself, before we lobotomize the code that way. ISTM a conformant implementation of listen() would limit the given value to 255 before storing it into an 8-bit field, not just lose high order bits. > After doing some more reading, I find that most OSes do not reject > connect requests that would exceed the specified backlog; instead, > they ignore the connection request and assume the client will retry > later. Therefore, it appears cannot use a small backlog to shed load > unless we assume that clients will time out quickly by themselves. Hm. newgate is a machine on my local net that's not currently up. $ time psql -h newgate postgres psql: could not connect to server: Connection timed out Is the server running on host newgate and accepting TCP/IPconnections on port 5432? real 1m13.33s user 0m0.02s sys 0m0.01s $ That's on HPUX 10.20. On an old Linux distro, the same timeout seems to be about 21 seconds, which is still pretty long by some standards. Do the TCP specs recommend anything particular about no-response-to-SYN timeouts? regards, tom lane