Thread: Cygwin PostgreSQL Regression Test Problems (Revisited)
On Mon, Jan 15, 2001 at 11:37:55PM -0500, Jason Tishler wrote: > 2. I am unable to successfully run the regression tests on a NT 4.0 SP5 > machine with only 64 MB of physical memory and about 175 MB of swap space. > Other than lacking RAM and swap space, this machine is the "same" as other > NT/2000 machines which can successfully run the regression tests. > > The tests usually hang during the "parallel group (18 tests)" test > right after numerology. By "hang," I mean that the original postmaster > is still running, but there are no postmaster children, and there are > some number of psql processes hanging around. Using NT's TaskManager, > I can see that the machine is running out of memory. I have even seen > the "Windows is running low on virtual memory" dialog a few times. > Should I expect this behavior from such a lame machine? I previously reported the above problem with the parallel version of the regression test (i.e., make check) on a machine with limited memory. Unfortunately, I am seeing similar problems on a machine with 192 MB of physical memory and about 208 MB of swap space. So, now I feel that my initial conclusion that limited memory was the root cause is erroneous. My current WAG is that there is a race condition in Cygwin that is causing the some back-end postgres processes to abort. This in turn causes the associated front-end psql processes to hang which in turn causes the regression test to hang. What is the best way to "catch" this problem? What are the best set of options to pass to postmaster that will be in turn passed to the back-end postgres processes to hopefully shed some light on this situation? Can I get the individual back-end postgres processes to log to separate files? There is so much going on during a parallel regression test that it's hard to figure out what is really happening. Any help would be greatly appreciated. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: > I previously reported the above problem with the parallel version of > the regression test (i.e., make check) on a machine with limited memory. > Unfortunately, I am seeing similar problems on a machine with 192 MB of > physical memory and about 208 MB of swap space. So, now I feel that my > initial conclusion that limited memory was the root cause is erroneous. Not necessarily. 18 parallel tests imply 54 concurrent processes (a shell, a psql, and a backend per test). Depending on whether Windoze is any good about sharing sharable pages across processes, it's not hard at all to believe that each process might chew up a few meg of memory and/or swap. You don't have a whole lot of headroom there if so. Try modifying the parallel_schedule file to break the largest set of concurrent tests down into two sets of nine tests. Considering that we've seen people run into maxuprc limits on some Unix versions, I wonder whether we ought to just do that across-the-board. > What is the best way to "catch" this problem? What are the best set of > options to pass to postmaster that will be in turn passed to the back-end > postgres processes to hopefully shed some light on this situation? I'd use -d1 which should be enough to see backends starting and exiting. Any more will clutter the log with individual queries, which probably would be more detail than you really want... regards, tom lane
Tom, On Wed, Mar 28, 2001 at 01:57:33PM -0500, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > I previously reported the above problem with the parallel version of > > the regression test (i.e., make check) on a machine with limited memory. > > Unfortunately, I am seeing similar problems on a machine with 192 MB of > > physical memory and about 208 MB of swap space. So, now I feel that my > > initial conclusion that limited memory was the root cause is erroneous. > > Not necessarily. 18 parallel tests imply 54 concurrent processes > (a shell, a psql, and a backend per test). Depending on whether Windoze > is any good about sharing sharable pages across processes, it's not hard > at all to believe that each process might chew up a few meg of memory > and/or swap. You don't have a whole lot of headroom there if so. I just increased the swap space (i.e., pagefile.sys) to 384 MB and I still get hangs. Watching memory usage via the NT Task Manager, Windows tells me that the memory usage during the regression test is <= 80 MB which is significantly less than my physical memory. I wonder if I'm bucking up against some Cygwin limitations. On the cygwin-developers list, there was a recent discussion that indicated that a Cygwin process can only have a max of 64 children. May be there is a limit like that which is causing backends to abort? > Try modifying the parallel_schedule file to break the largest set of > concurrent tests down into two sets of nine tests. I'm sure that will work (at least most of the time) since I only get one of two psql processes to hangs for any given run. But, "fixing" the problem this way just doesn't feel right to me. > Considering that we've seen people run into maxuprc limits on some Unix > versions, I wonder whether we ought to just do that across-the-board. Of course, this solution is much better. :,) > > What is the best way to "catch" this problem? What are the best set of > > options to pass to postmaster that will be in turn passed to the back-end > > postgres processes to hopefully shed some light on this situation? > > I'd use -d1 which should be enough to see backends starting and exiting. > Any more will clutter the log with individual queries, which probably > would be more detail than you really want... I've done the above and it seems to indicate that all backends exited with a status of 0. So, I still don't know why some backends "aborted." Any other suggestions? Such as somehow specifying an individual log file for each backend. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: > I've done the above and it seems to indicate that all backends exited > with a status of 0. So, I still don't know why some backends "aborted." Hm. So what exactly is the failure mode? Do the psql processes report any errors? Have they produced (any/all of) the expected output? regards, tom lane
Tom, On Wed, Mar 28, 2001 at 04:40:30PM -0500, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > I've done the above and it seems to indicate that all backends exited > > with a status of 0. So, I still don't know why some backends "aborted." > > Hm. So what exactly is the failure mode? Do the psql processes report > any errors? Have they produced (any/all of) the expected output? The failure mode is always something like the following: The regression test proceeds normally until one of the larger parallel groups is running. Then it will hang after output such as: parallel group (18 tests): point lseg box path circle date polygon time abstime inet interval reltime type_sanity oidjoinsopr_sanity timestamp... If I do a ps, I will see the postmaster process and one or more psql processes. The corresponding postgres processes are no longer running. (Were they ever running?) The NT Task Manager shows essentially 100% idle. I usually kill the psql processes, with the following command: kill $(ps | fgrep psql | awk '{print $1}') Then the regression test will continue with output like the following: ...Signal 15 Signal 15 comments tinterval point ... ok lseg ... ok box ... ok path ... ok polygon ... ok circle ... ok date ... ok time ... ok timestamp ... ok interval ... ok abstime ... ok reltime ... ok tinterval ... FAILED inet ... ok comments ... FAILED oidjoins ... ok type_sanity ... ok opr_sanity ... ok test geometry ... ok .. I believe that the "failures" above correspond to the psql processes that I killed. Sometimes the regression test will run to completion without any more hangs. Sometimes it will hang at one or more large parallel groups. If I continue to kill the psql processes as above, the regression test will eventually complete (with more "failures"). I've trying another experiment of killing a postgres backend to see if the psql process notices the backend dying. It does but I was only able to kill -9 the postgres backend. Otherwise, postgres ignored the signal. So, I don't know if my experiment was valid. If a backend exits normally while a psql is connected, will the psql process notice this event? Any other suggestions? Or, should I just run the serial_schedule and stop my head banging? Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: > Then the regression test will continue with output like the following: > ...Signal 15 > Signal 15 > comments tinterval > point ... ok > lseg ... ok > box ... ok > path ... ok > polygon ... ok > circle ... ok > date ... ok > time ... ok > timestamp ... ok > interval ... ok > abstime ... ok > reltime ... ok > tinterval ... FAILED > inet ... ok > comments ... FAILED > oidjoins ... ok > type_sanity ... ok > opr_sanity ... ok > test geometry ... ok > .. This doesn't tell us much. What shows up in the output files of the failed tests --- what are the *diffs*, not just the summary display? regards, tom lane
Tom Lane wrote: > > Jason Tishler <Jason.Tishler@dothill.com> writes: > > Then the regression test will continue with output like the following: > > > ...Signal 15 > > Signal 15 > > comments tinterval > > point ... ok > > lseg ... ok > > box ... ok > > path ... ok > > polygon ... ok > > circle ... ok > > date ... ok > > time ... ok > > timestamp ... ok > > interval ... ok > > abstime ... ok > > reltime ... ok > > tinterval ... FAILED > > inet ... ok > > comments ... FAILED > > oidjoins ... ok > > type_sanity ... ok > > opr_sanity ... ok > > test geometry ... ok > > .. > > This doesn't tell us much. What shows up in the output files of the > failed tests --- what are the *diffs*, not just the summary display? > Hmmm, *diffs* are available little. psql hangs at PQsetdbLogin()(select() in the first pqWait() in connectDBComplete()). regards, Hiroshi Inoue
Jason, On Wed, 28 Mar 2001 13:36:45 -0500 Jason Tishler <Jason.Tishler@dothill.com> wrote: > On Mon, Jan 15, 2001 at 11:37:55PM -0500, Jason Tishler wrote: > > The tests usually hang during the "parallel group (18 tests)" test > > right after numerology. By "hang," I mean that the original postmaster > > is still running, but there are no postmaster children, and there are > > some number of psql processes hanging around. Using NT's TaskManager, > > I can see that the machine is running out of memory. I have even seen > > the "Windows is running low on virtual memory" dialog a few times. > > Should I expect this behavior from such a lame machine? > I previously reported the above problem with the parallel version of > the regression test (i.e., make check) on a machine with limited memory. > Unfortunately, I am seeing similar problems on a machine with 192 MB of > physical memory and about 208 MB of swap space. So, now I feel that my > initial conclusion that limited memory was the root cause is erroneous. I can't reproduce it. Paralell regression test works perfectly and returns "All 76 tests passed." . There's no hung-up. Enviroment: PIII-550 , 256MB RAM PC NT4.0 + SP6 PostgreSQL 7.1Beta6 cygipc 1.08+my 2 patch Cygwin1.dll 010215 snapshot -- Yutaka tanida <yutaka@hi-net.zaq.ne.jp>
Tom, On Wed, Mar 28, 2001 at 06:06:22PM -0500, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > So no queries get executed at all before the backend exits. Given that > the backend seems to be exiting normally, one would suppose that the > backend thinks it is seeing an EOF from the client. Is there anything > about "unexpected EOF on client connection" in the postmaster log? I grep-ed for EOF in postmaster.log but came up empty. Did I need to run with debugging turned on to see this error message? I was running *without* debugging turned on. > Another possibility is that the failing psqls are never managing to > connect in the first place. Can you attach to one of the stuck psqls > with gdb and get a backtrace to see where it is? I get the following backtrace for one of the hung psql processes: (gdb) bt #0 0x77f682cb in ?? () #1 0x77f1cd76 in ?? () #2 0x6103deee in _size_of_stack_reserve__ () #3 0x6103d84e in _size_of_stack_reserve__ () #4 0x67989978 in pqWait (forRead=0, forWrite=1, conn=0xa010258) at fe-misc.c:738 #5 0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103 #6 0x67981fb1 in PQsetdbLogin (pghost=0x0, pgport=0x0, pgoptions=0x0, pgtty=0x0, dbName=0x1a0260e8 "regression", login=0x0, pwd=0x0) at fe-connect.c:524 #7 0x40e43f in main (argc=6, argv=0x1a021ad8) at startup.c:178 On Thu, Mar 29, 2001 at 03:20:59PM +0900, Hiroshi Inoue wrote: > psql hangs at PQsetdbLogin()(select() in the > first pqWait() in connectDBComplete()). Note that my hang seems identical to the one reported by Hiroshi Inoue. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: > I get the following backtrace for one of the hung psql processes: > (gdb) bt > #0 0x77f682cb in ?? () > #1 0x77f1cd76 in ?? () > #2 0x6103deee in _size_of_stack_reserve__ () > #3 0x6103d84e in _size_of_stack_reserve__ () > #4 0x67989978 in pqWait (forRead=0, forWrite=1, conn=0xa010258) > at fe-misc.c:738 > #5 0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103 > #6 0x67981fb1 in PQsetdbLogin (pghost=0x0, pgport=0x0, pgoptions=0x0, > pgtty=0x0, dbName=0x1a0260e8 "regression", login=0x0, pwd=0x0) > at fe-connect.c:524 > #7 0x40e43f in main (argc=6, argv=0x1a021ad8) at startup.c:178 It would be helpful to see the contents of the conn object ("f 5" then "p *conn" should do it). If Hiroshi is correct that this is the *first* call to pqWait in connectDBComplete, then I think we are looking at a kernel bug (or more likely a cygwin bug). psql has opened a TCP connection socket and is now waiting for the socket to show as write-ready before it will send a connection request. If select() never reports the socket as write-ready, you have a hang ... and it's not possible to blame the hang on anything else but the kernel. Both ends of the connection are on the same machine, so there's no network problem or anything like that. There is not anything else that we should need to do at the application level before we should be allowed to send data. regards, tom lane
Jason Tishler <Jason.Tishler@dothill.com> writes: > status = CONNECTION_STARTED, asyncStatus = PGASYNC_IDLE, Oh-ho, that's interesting! If you look at fe-connect.c you'll see that CONNECTION_STARTED must indicate that connect() returned EINPROGRESS rather than a success indication. The socket is supposed to go write-ready when the connection is finished --- for example HPUX's connect man page sez [EINPROGRESS] Nonblocking I/O is enabled using O_NONBLOCK, O_NDELAY, or FIOSNBIO, and the connection cannot be completed immediately. This is not a failure. Make the connect() call again a few seconds later. Alternatively, wait for completion by calling select() and selecting for write. But, evidently, it never is coming ready for write. BTW, I note that we are trying to use Unix sockets here. Does the bug still appear if you force pg_regress to use TCP connections? regards, tom lane
Tom, On Thu, Mar 29, 2001 at 10:43:49AM -0500, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > I get the following backtrace for one of the hung psql processes: > > > (gdb) bt > > #0 0x77f682cb in ?? () > > #1 0x77f1cd76 in ?? () > > #2 0x6103deee in _size_of_stack_reserve__ () > > #3 0x6103d84e in _size_of_stack_reserve__ () > > #4 0x67989978 in pqWait (forRead=0, forWrite=1, conn=0xa010258) > > at fe-misc.c:738 > > #5 0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103 > > #6 0x67981fb1 in PQsetdbLogin (pghost=0x0, pgport=0x0, pgoptions=0x0, > > pgtty=0x0, dbName=0x1a0260e8 "regression", login=0x0, pwd=0x0) > > at fe-connect.c:524 > > #7 0x40e43f in main (argc=6, argv=0x1a021ad8) at startup.c:178 > > It would be helpful to see the contents of the conn object ("f 5" then > "p *conn" should do it). I did as you suggested above and got the following: (gdb) f 5 #5 0x6798287c in connectDBComplete (conn=0xa010258) at fe-connect.c:1103 1103 if (pqWait(0, 1, conn)) (gdb) p *conn $1 = {pghost = 0x0, pghostaddr = 0x0, pgport = 0xa016610 "65432", pgunixsocket = 0x0, pgtty = 0xa016620 "", pgoptions = 0xa016630 "", dbName = 0xa017170 "regression", pguser = 0xa017150 "jt", pgpass = 0xa017160 "", Pfdebug = 0x0, noticeHook = 0x67984e8c <defaultNoticeProcessor>, noticeArg = 0x0, status = CONNECTION_STARTED, asyncStatus = PGASYNC_IDLE, notifyList = 0xa0103e0, sock = 3, laddr = {sa = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, in = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, __pad = "\000\000\000\000\000\000\000"}, un = {sun_family = 0, sun_path = '\000' <repeats 107 times>}}, raddr = {sa = {sa_family = 1, sa_data = "/tmp/.s.PGSQL."}, in = {sin_family = 1, sin_port = 29743, sin_addr = {s_addr = 774860909}, __pad = "s.PGSQL."}, un = { sun_family = 1, sun_path = "/tmp/.s.PGSQL.65432", '\000' <repeats 88 times>}}, raddr_len = 21, be_pid = 0, be_key = 0, salt = "\000", lobjfuncs = 0x0, inBuffer = 0xa0103f0 "", inBufSize = 16384, inStart = 0, inCursor = 0, inEnd = 0, nonblocking = 0, outBuffer = 0xa0143f8 "", outBufSize = 8192, outCount = 0, result = 0x0, curTuple = 0x0, setenv_state = SETENV_STATE_IDLE, next_eo = 0x0, errorMessage = { data = 0xa016400 "", len = 0, maxlen = 256}, workBuffer = { data = 0xa016508 "", len = 0, maxlen = 256}, client_encoding = 0} > If Hiroshi is correct that this is the *first* call to pqWait in > connectDBComplete, then I think we are looking at a kernel bug (or more > likely a cygwin bug). psql has opened a TCP connection socket and is > now waiting for the socket to show as write-ready before it will send > a connection request. If select() never reports the socket as > write-ready, you have a hang ... and it's not possible to blame the hang > on anything else but the kernel. Both ends of the connection are on the > same machine, so there's no network problem or anything like that. > There is not anything else that we should need to do at the application > level before we should be allowed to send data. Does the details reported above support your hypothesis? If so, can you assist me in formulating a minimal test case that I can take back to the Cygwin community? Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Tom, On Thu, Mar 29, 2001 at 11:40:08AM -0500, Tom Lane wrote: > BTW, I note that we are trying to use Unix sockets here. Does the bug > still appear if you force pg_regress to use TCP connections? I'm not sure if you already know this, but Cygwin Unix sockets are actually implemented as TCP/IP sockets. Anyway, I forced TCP connections and got the same psql hang: (gdb) p *conn $1 = {pghost = 0xa016618 "localhost", pghostaddr = 0x0, pgport = 0xa016628 "65432", pgunixsocket = 0x0, pgtty = 0xa016638 "", pgoptions = 0xa016648 "", dbName = 0xa017188 "regression", pguser = 0xa017168 "jt", pgpass = 0xa017178 "", Pfdebug = 0x0, noticeHook = 0x67984e8c <defaultNoticeProcessor>, noticeArg = 0x0, status = CONNECTION_STARTED, asyncStatus = PGASYNC_IDLE, notifyList = 0xa0103e8, sock = 3, laddr = {sa = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, in = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, __pad = "\000\000\000\000\000\000\000"}, un = {sun_family = 0, sun_path = '\000' <repeats 107 times>}}, raddr = {sa = {sa_family = 2, sa_data = "ÿ\230\177\000\000\001\000\000\000\000\000\000\000"}, in = { sin_family = 2, sin_port = 39167, sin_addr = {s_addr = 16777343}, __pad = "\000\000\000\000\000\000\000"}, un = {sun_family = 2, sun_path = "ÿ\230\177\000\000\001", '\000' <repeats 101 times>}}, raddr_len = 16, be_pid = 0, be_key = 0, salt = "\000", lobjfuncs = 0x0, inBuffer = 0xa0103f8 "", inBufSize = 16384, inStart = 0, inCursor = 0, inEnd = 0, nonblocking = 0, outBuffer = 0xa014400 "", outBufSize = 8192, outCount = 0, result = 0x0, curTuple = 0x0, setenv_state = SETENV_STATE_IDLE, next_eo = 0x0, errorMessage = { data = 0xa016408 "", len = 0, maxlen = 256}, workBuffer = { data = 0xa016510 "", len = 0, maxlen = 256}, client_encoding = 0} > But, evidently, it never is coming ready for write. Any ideas on how to demonstrate this to the Cygwin community without all of the PostgreSQL baggage. Sorry, but I'm not very experienced with sockets. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Tom, On Thu, Mar 29, 2001 at 01:00:44PM -0500, Tom Lane wrote: > Not sure why this guy only responded to me and not the list, but here's > a lead you might want to follow up ... > > On Thu, 29 Mar 2001 10:49:16 -0700, Scott Ribe wrote: > > On Thu, Mar 29, 2001, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >Oh-ho, that's interesting! If you look at fe-connect.c you'll see that > > >CONNECTION_STARTED must indicate that connect() returned EINPROGRESS > > >rather than a success indication. The socket is supposed to go > > >write-ready when the connection is finished... > > > > Uhm, generally speaking I am not qualified to participate in this > > discussion... > > > > BUT I am pretty sure that some time past while searching for some other > > network-related info on the MS web site I came across a document > > describing bugs (or unique MS "features") in non-blocking IO and > > particularly discussed the EINPROGRESS return value. > > > > I don't know what I'm talking about, I could be wrong, but I think you > > should search on the MS web site for nonblocking IO and EINPROGRESS and you > > might find the exact info that you need to discuss with the Cygwin folks. I quickly searched the MSDN and could not find anything explicitly mentioning problems with non-blocking I/O and EINPROGRESS. Nevertheless, in src/interfaces/libpq/fe-connect.c, I found the following comment: /* ---------- * Since I have no idea whether this is a valid thing to do under Windows * before a connection is made, and since I have no way of testing it, I * leave the code looking as below. When someone decides that they want * non-blocking connections under Windows, they can define * WIN32_NON_BLOCKING_CONNECTIONS before compilation. If it works, then * this code can be cleaned up. Cygwin is essentially Windows in this regard since Cygwin uses Windows sockets to implement Posix sockets. My WAG is that if EINPROGRESS is returned during a connect attempt then the regression test hangs; otherwise, the regression test runs to completion. So, I applied the attached patch so that non-blocking I/O is not enabled until after the connection has been established (just like with Win32 and SSL). I have the regression test running in a forever loop. So far it has succeeded 10 times without a hang. On this machine, I have never been able to get more than three in a row to succeed before. I am going to run the regression tests all night. I will report back tomorrow to let the list know whether or not I got any hangs. Would the PostgreSQL team be willing to accept this patch? At least, until I determine whether or not I can get Cygwin "fixed?" I will post to the Cygwin list tomorrow (when/if they are back up). BTW, Cygwin did not support non-blocking (socket) I/O until 1.1.5 which is in the November 2000 time frame. So another WAG is that this problem started to occur then, but I don't really remember that well. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Attachment
Tom, On Thu, Mar 29, 2001 at 04:55:17PM -0500, Jason Tishler wrote: > Cygwin is essentially Windows in this regard since Cygwin uses Windows > sockets to implement Posix sockets. My WAG is that if EINPROGRESS is > returned during a connect attempt then the regression test hangs; > otherwise, the regression test runs to completion. > > So, I applied the attached patch so that non-blocking I/O is not enabled > until after the connection has been established (just like with Win32 > and SSL). I have the regression test running in a forever loop. So far > it has succeeded 10 times without a hang. On this machine, I have never > been able to get more than three in a row to succeed before. > > I am going to run the regression tests all night. I will report back > tomorrow to let the list know whether or not I got any hangs. The regression test forever loop ran all night without a hang -- 150+ successes in a row. So, I feel that it is safe to say that Cygwin (or Windows Sockets) has problems with nonblocking connects. > Would the PostgreSQL team be willing to accept this patch? Any feedback on my patch (reattached for convenience)? I would hate to see 7.1 go out the door with this issue. BTW, why is libpq's connection policy currently nonblocking for all platforms except (straight) Win32? Do people try to connect to multiple postmasters concurrently? If not, then what is the benefit over a blocking connect? > At least, > until I determine whether or not I can get Cygwin "fixed?" I will post > to the Cygwin list tomorrow (when/if they are back up). I will post to the Cygwin list regarding this problem. Just to make sure that I have my story straight: psql is hanging while trying a nonblocking connect to postmaster (not one of the backends). Correct? If anyone has any nonblocking socket client code with a corresponding server lying around please let me know. I would like to post this to the Cygwin list to facilitate their debugging. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Attachment
Tom, On Fri, Mar 30, 2001 at 09:25:47AM -0500, Jason Tishler wrote: > Any feedback on my patch (reattached for convenience)? I would hate to > see 7.1 go out the door with this issue. I believe that I have finally found the root cause to the psql hangs. IMO, Cygwin is functioning properly and the issue lies in the libpq's pqWait() use of select(). The MSDN states the following for select(): .. Summary: A socket will be identified in a particular set when select returns if: .. exceptfds: If processing a connect call (nonblocking), connection attempt failed. .. In libpq's pqWait(), we have the following: if (select(conn->sock + 1, &input_mask, &output_mask, (fd_set *) NULL, (struct timeval *) NULL) < 0) After reading the above code, I hypothesized that select() was hanging because the exceptfds was NULL. Sure enough, if I apply the attached (nasty, hacky) patch, then the regression test does *not* hang anymore -- even with nonblocking connects. Although some tests will fail due to a connection refused condition -- which is not unreasonable since postmaster is very busy. IMO, pqWait() should be enhanced to check the exceptfds too -- at least for Cygwin. If it is not too late in the release cycle to consider such a change, then someone with much more intimate knowledge of libpq should only use my patch as a starting point and then do the right thing. If the above enhancement is deemed too risky, then I implore the PostgreSQL team to accept my previous patch that just makes connects blocking for Cygwin. Note with this patch applied, I did see some regression test failures due to a connection refused condition -- for the same reasons as above. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Attachment
> -----Original Message----- > From: Jason Tishler > > Tom, > > On Fri, Mar 30, 2001 at 09:25:47AM -0500, Jason Tishler wrote: > > Any feedback on my patch (reattached for convenience)? I would hate to > > see 7.1 go out the door with this issue. > > I believe that I have finally found the root cause to the psql hangs. > IMO, Cygwin is functioning properly and the issue lies in the libpq's > pqWait() use of select(). > > The MSDN states the following for select(): > > .. > Summary: > A socket will be identified in a particular set when select > returns if: > > .. > > exceptfds: > If processing a connect call (nonblocking), connection > attempt failed. > .. > Oh I found the same description yesterday though I've had no time to test it. If your patch resolves *hang*, it may be the right solution at least for cygwin port. BTW I've never passed the pararell regression test without hang or refusal(with your previous patch appiled) under my cygwin environ- ment. I added one more connect() call after the refusal and passed all regression test successfully. Hmm it may be a more preferable solution. regards, Hiroshi Inoue
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > Oh I found the same description yesterday though I've had no time > to test it. If your patch resolves *hang*, it may be the right solution > at least for cygwin port. It seems clear that it's a good idea for fe-misc.c to check the exceptfds bit as well as read/write ready --- I'm surprised we have not seen problems associated with that on other platforms. I think it should check exceptfds all the time, regardless of whether we are waiting to read or to write. I'm inclined to also accept Jason's change to do the connect() in blocking mode on Cygwin. If we do both of those things, have we resolved the issue on Cygwin, or is there still a problem? regards, tom lane
Tom, On Sat, Mar 31, 2001 at 05:45:45PM -0500, Tom Lane wrote: > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > Oh I found the same description yesterday though I've had no time > > to test it. If your patch resolves *hang*, it may be the right solution > > at least for cygwin port. > > It seems clear that it's a good idea for fe-misc.c to check the > exceptfds bit as well as read/write ready --- I'm surprised we have not > seen problems associated with that on other platforms. I think it > should check exceptfds all the time, regardless of whether we are > waiting to read or to write. I'm glad that you agree. Please post to the list when the change is in CVS and I will test that this solves the Cygwin regression test (i.e., psql) hangs. BTW, this will also solve the problem of Cygwin psql hanging when no postmaster is running which I stumbled across when enabling Unix domain socket support. Previously, I thought that it was a Cygwin problem but now I know that it is caused by the same pqWait() problem. > I'm inclined to also accept Jason's change to do the connect() in > blocking mode on Cygwin. Actually, the blocking connect() change for Cygwin is obviated by the pqWait() fix. So, I am now no longer recommending making the blocking connect() change for Cygwin. Unless, you do so for other Unixes too. > If we do both of those things, have we > resolved the issue on Cygwin, or is there still a problem? If you do both of these changes, then the pqWait() fix will never be triggered under Cygwin. When I tested my hacky patch to pqWait(), I had to back out my blocking connect() patch in order for the pqWait() changes to take affect. The regression test still did not hang -- although, I continued to have spurious failures due to connection refused conditions. On Sat, Mar 31, 2001 at 10:15:08AM +0900, Hiroshi Inoue wrote: > BTW I've never passed the pararell regression test without hang or > refusal(with your previous patch appiled) under my cygwin environ- > ment. I added one more connect() call after the refusal and passed > all regression test successfully. Hmm it may be a more preferable > solution. I'm wondering whether it makes sense to add a simple connection retry policy as suggested above by Hiroshi? Otherwise, make check will generate false negatives due to connection refused conditions. If it is considered too late in the release cycle for such a change, then I offer the following suggestions: 1. Change make check to use the serial_schedule or at least allow it to be easily selected via a make variable (e.g., make schedule=serial_schedule check). 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c to a number that will "ensure" that the parallel_schedule version of the regression test does not generate connection refused conditions. Note that I'm not even sure this will really work on all (or any) platforms. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Tom, On Sat, Mar 31, 2001 at 10:07:22PM -0500, Jason Tishler wrote: > BTW, this will also solve the problem of Cygwin psql hanging when no > postmaster is running which I stumbled across when enabling Unix domain > socket support. Previously, I thought that it was a Cygwin problem but > now I know that it is caused by the same pqWait() problem. Oops, I meant an unconnected socket file (e.g., /tmp/.s.PGSQL.5432) above -- not no postmaster is running. That's the problem with taking notes (which I rarely do but did in this case) -- you actually have to review the notes for them to be useful... :,) Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
> -----Original Message----- > From: Jason Tishler [mailto:Jason.Tishler@dothill.com] > > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c > to a number that will "ensure" that the parallel_schedule version of the > regression test does not generate connection refused conditions. Note > that I'm not even sure this will really work on all (or any) platforms. > Hmm, I changed the backlog parameter on trial but I wasn't able to see any improvements. regards, Hiroshi Inoue
Hiroshi, On Sun, Apr 01, 2001 at 11:45:04PM +0900, Hiroshi Inoue wrote: > > From: Jason Tishler [mailto:Jason.Tishler@dothill.com] > > > > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c > > to a number that will "ensure" that the parallel_schedule version of the > > regression test does not generate connection refused conditions. Note > > that I'm not even sure this will really work on all (or any) platforms. > > > > Hmm, I changed the backlog parameter on trial but I wasn't able > to see any improvements. That is what I kind of expected. Even if it worked, it would not have been a full proof solution anyway. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: >> It seems clear that it's a good idea for fe-misc.c to check the >> exceptfds bit as well as read/write ready --- I'm surprised we have not >> seen problems associated with that on other platforms. I think it >> should check exceptfds all the time, regardless of whether we are >> waiting to read or to write. > I'm glad that you agree. Please post to the list when the change is in > CVS and I will test that this solves the Cygwin regression test (i.e., > psql) hangs. Done as of yesterday; should be in this morning's snapshot. > Actually, the blocking connect() change for Cygwin is obviated by the > pqWait() fix. So, I am now no longer recommending making the blocking > connect() change for Cygwin. Unless, you do so for other Unixes too. I made both changes in the hope that the blocking connect change would suppress your problem with connection-refused failures. If it does not, then we may as well reverse out the fe-connect.c change. Let me know. >> If we do both of those things, have we >> resolved the issue on Cygwin, or is there still a problem? > I'm wondering whether it makes sense to add a simple connection retry > policy as suggested above by Hiroshi? I do not think it is appropriate for libpq to do that. For one thing, where would you stop --- why exactly two tries? > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c > to a number that will "ensure" that the parallel_schedule version of the > regression test does not generate connection refused conditions. Note > that I'm not even sure this will really work on all (or any) platforms. We already use SOMAXCONN which is supposed to be defined by the system as the maximum allowed queue depth. If Cygwin fails to define it, or defines it as something less than it should be, then we might consider installing a Cygwin-specific hack to redefine SOMAXCONN. However Hiroshi says later that he already tried this. I'm inclined to think that Cygwin simply has a problem with servicing concurrent connection requests, perhaps even before the alleged SOMAXCONN value is reached. regards, tom lane
Tom, On Sun, Apr 01, 2001 at 01:57:35PM -0400, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > I'm glad that you agree. Please post to the list when the change is in > > CVS and I will test that this solves the Cygwin regression test (i.e., > > psql) hangs. > > Done as of yesterday; should be in this morning's snapshot. Thanks. > > Actually, the blocking connect() change for Cygwin is obviated by the > > pqWait() fix. So, I am now no longer recommending making the blocking > > connect() change for Cygwin. Unless, you do so for other Unixes too. > > I made both changes in the hope that the blocking connect change would > suppress your problem with connection-refused failures. If it does not, > then we may as well reverse out the fe-connect.c change. Let me know. With both changes or only the fe-connect.c one, psql does not hang and displays the following error message when the connection is refused: psql: connectDBStart() -- connect() failed: Connection refused Is the postmaster running locally and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'? With only the fe-misc.c change, psql does not hang and displays the following error message when the connection is refused: psql: PQconnectPoll() -- connect() failed: error 10061 Is the postmaster running locally and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'? In both cases there are no hangs, just the error messages are different. Unfortunately, for the non-blocking case the error message is cryptic. I tried tracking down error "10061" which comes from getsockopt(), but I was unsuccessful. Is there any way to improve the readability of this error message? Also, the blocking connect change did *not* fix the connection refused (spurious) regression test failures. So this change should probably be backed out. > > I'm wondering whether it makes sense to add a simple connection retry > > policy as suggested above by Hiroshi? > > I do not think it is appropriate for libpq to do that. When I made my suggestion above, I was concerned that may be libpq was not the right layer to be implementing connection policies and that possibly psql was the better place. > For one thing, where would you stop --- why exactly two tries? This was another one of my concerns too. > > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c > > to a number that will "ensure" that the parallel_schedule version of the > > regression test does not generate connection refused conditions. Note > > that I'm not even sure this will really work on all (or any) platforms. > > We already use SOMAXCONN which is supposed to be defined by the system > as the maximum allowed queue depth. If Cygwin fails to define it, or > defines it as something less than it should be, then we might consider > installing a Cygwin-specific hack to redefine SOMAXCONN. Cygwin defines SOMAXCONN to be 5. However, winsock.h defines it to be 5 while winsock2.h defines it to be 0x7fffffff. So, I'm not sure what it the real Cygwin (i.e., Windows) maximum. > However Hiroshi says later that he already tried this. Even if it worked, this would have just pushed the problem instead of really fixing it. > I'm inclined to think > that Cygwin simply has a problem with servicing concurrent connection > requests, perhaps even before the alleged SOMAXCONN value is reached. You meant Windows. Right? :,) In summary, I feel that the fe-connect.c change should be backed out so that Cygwin will be consistent with other UNIXes. I also hope that the non-blocking connection failure message can be made more readable and that make check will not generate spurious failure messages under Cygwin on slow machines. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: > In both cases there are no hangs, just the error messages are different. > Unfortunately, for the non-blocking case the error message is cryptic. > I tried tracking down error "10061" which comes from getsockopt(), but > I was unsuccessful. Is there any way to improve the readability of this > error message? I'm inclined to leave the blocking-connect change in there just to suppress that peculiar error message. No, I have no idea where it's coming from, either. >> However Hiroshi says later that he already tried [ raising SOMAXCONN ] > Even if it worked, this would have just pushed the problem instead of > really fixing it. If the problem were overflow of the accept queue, then raising the listen() parameter ought to fix it, assuming that Windows does actually allow larger values for the parameter. Given that we are only hearing this problem reported on Windows, I have a sneaking suspicion that the effective queue length limit is 1 on that platform no matter what we pass to listen(). Is there anyone we might ask about concurrent connection-request handling on Windows? regards, tom lane
Tom, On Mon, Apr 02, 2001 at 01:44:14PM -0400, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > In both cases there are no hangs, just the error messages are different. > > Unfortunately, for the non-blocking case the error message is cryptic. > > I tried tracking down error "10061" which comes from getsockopt(), but > > I was unsuccessful. Is there any way to improve the readability of this > > error message? > > I'm inclined to leave the blocking-connect change in there just to > suppress that peculiar error message. No, I have no idea where it's > coming from, either. I just figured out what is error 10061 -- it is WSAECONNREFUSED, Winsock's version of ECONNREFUSED. I just submitted a patch to Cygwin that maps getsockopt optval's from the Winsock versions to their corresponding errno values. I just tried psql with an unconnected socket file and psql displayed: psql: PQconnectPoll() -- connect() failed: Connection refused Is the postmaster running locally and accepting connections on Unix socket '/tmp/.s.PGSQL.5432'? as desired. If interested, see the following for details: http://www.cygwin.com/ml/cygwin-patches/2001-q2/msg00003.html If my Cygwin patch is accepted, I'll let the list know. At that time, I think that the fe-connect.c change should be backed out. > >> However Hiroshi says later that he already tried [ raising SOMAXCONN ] > > > Even if it worked, this would have just pushed the problem instead of > > really fixing it. > > If the problem were overflow of the accept queue, then raising the > listen() parameter ought to fix it, assuming that Windows does actually > allow larger values for the parameter. Given that we are only hearing > this problem reported on Windows, I have a sneaking suspicion that the > effective queue length limit is 1 on that platform no matter what we > pass to listen(). Is there anyone we might ask about concurrent > connection-request handling on Windows? In digging some more through the MSDN, I found out the backlog limit on NT 4.0 Workstation and Server is 5 and 200, respectively. So, it would appears that NT is really using this parameter. If interested, see the following for more details: http://support.microsoft.com/support/kb/articles/Q127/1/44.asp When running the parallel_schedule, as many as 18 psql's are trying to connect to postmaster. Isn't it conceivable that more than 6 are trying to connection concurrently? Thanks, Jason Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: > I just figured out what is error 10061 -- it is WSAECONNREFUSED, Winsock's > version of ECONNREFUSED. I just submitted a patch to Cygwin that maps > getsockopt optval's from the Winsock versions to their corresponding > errno values. Ah so. Sounds good. > If my Cygwin patch is accepted, I'll let the list know. At that time, I > think that the fe-connect.c change should be backed out. My feeling is that we should leave it in place for 7.1 in any case. Once there's a shipping Cygwin version that maps the error number correctly, we can back out the patch so that Cygwin is treated more like other platforms. > In digging some more through the MSDN, I found out the backlog limit > on NT 4.0 Workstation and Server is 5 and 200, respectively. This page only talks about NT; what of other flavors of Windows? Cygwin runs on more than NT, doesn't it? Interesting point here: a copy of Postgres compiled on NT WS would presumably see SOMAXCONN = 5 in the system headers. If the executable is then moved to NT Server, it would fail to take advantage of the higher queue limit. Do we need to hardwire a hack to use the larger value always on Windows? > When running the parallel_schedule, as many as 18 psql's are trying to > connect to postmaster. Isn't it conceivable that more than 6 are trying > to connection concurrently? Yes (although that's still hypothesis, not the proven cause of failure). I still suspect there's something else going on here, anyway. SOMAXCONN is nominally 5 on quite a lot of Unixen, but we've only heard reports of transient "make check" connect failures on Windows. Why is Windows so much more prone to show this problem? regards, tom lane
I wrote: > I still suspect there's something else going on here, anyway. SOMAXCONN > is nominally 5 on quite a lot of Unixen, but we've only heard reports of > transient "make check" connect failures on Windows. Why is Windows so > much more prone to show this problem? Hm, maybe I need to take this back. Some poking around shows that SOMAXCONN is defined as 128 on Linux, 20 on HPUX, which are the platforms I've tested most. As an experiment I reduced the listen() parameter to 5 on HPUX, and bingo: I get connection-refused failures in "make check". So it seems that Windows' behavior is not so out of line after all. We would probably see similar failures on BSD-derived systems, since BSD systems traditionally set SOMAXCONN to 5. (Any BSD partisans able to check this?) I do not think that we should change "make check" to avoid this issue. If you are on a platform that has a problem with supporting lots of parallel connection requests, it seems to me that you'd best know about that limitation, and "make check" is doing you a service by pointing out the problem. What I do think we should consider is whether to believe SOMAXCONN unconditionally, or to use a large value in the listen() call no matter what the system headers claim SOMAXCONN is. This would avoid sillinesses such as using an NT-Workstation limit on an NT-Server machine. The only risk I can see is that some platforms might reject as erroneous a listen() parameter that's more than they are prepared to support. The Unix man pages I have access to claim that a too-large listen() parameter will be clamped to the kernel's SOMAXCONN without raising an error, but does anyone have an idea whether that behavior is universal? In the longer term, we should think about whether we can reduce the postmaster's connection service delay. Someone recently suggested that the postmaster should fork a child immediately upon receiving a connection, and let the child work on the authentication process while the parent goes right back to accept(). I'm not sure if that would help "make check" very much, since it's presumably not running anything more complex than "trust" authentication anyway. But it should eliminate auth delays caused by SSL, malfunctioning ident daemons, and sundry other problems. regards, tom lane
Tom, On Mon, Apr 02, 2001 at 03:50:55PM -0400, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > If my Cygwin patch is accepted, I'll let the list know. At that time, I > > think that the fe-connect.c change should be backed out. > > My feeling is that we should leave it in place for 7.1 in any case. > Once there's a shipping Cygwin version that maps the error number > correctly, we can back out the patch so that Cygwin is treated more > like other platforms. OK, the above plan is reasonable. > > In digging some more through the MSDN, I found out the backlog limit > > on NT 4.0 Workstation and Server is 5 and 200, respectively. > > This page only talks about NT; what of other flavors of Windows? Cygwin > runs on more than NT, doesn't it? Yes, it runs on 2000, 9X/Me, and even XP. Unfortunately, I couldn't (easily) find the limits for these versions. My WAG is that 2000 and XP will be the same or similar to NT. I am not concerned about 9X/Me because I find them unusable for other reasons. > Interesting point here: a copy of Postgres compiled on NT WS would > presumably see SOMAXCONN = 5 in the system headers. If the executable > is then moved to NT Server, it would fail to take advantage of the > higher queue limit. Actually, even if compiled on NT Server, SOMAXCONN is it set to 5 due to Cygwin's socket.h. > Do we need to hardwire a hack to use the larger > value always on Windows? Sounds like a good idea, but the effort only seems reasonable if we can conclude that Windows will really take advantage of it. > > When running the parallel_schedule, as many as 18 psql's are trying to > > connect to postmaster. Isn't it conceivable that more than 6 are trying > > to connection concurrently? > > Yes (although that's still hypothesis, not the proven cause of failure). > > I still suspect there's something else going on here, anyway. SOMAXCONN > is nominally 5 on quite a lot of Unixen, but we've only heard reports of > transient "make check" connect failures on Windows. Why is Windows so > much more prone to show this problem? I don't know! I've been banging my head to find out why and my head is starting to hurt... :,) Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
> I wrote: > > I still suspect there's something else going on here, anyway. SOMAXCONN > > is nominally 5 on quite a lot of Unixen, but we've only heard reports of > > transient "make check" connect failures on Windows. Why is Windows so > > much more prone to show this problem? > > Hm, maybe I need to take this back. Some poking around shows that > SOMAXCONN is defined as 128 on Linux, 20 on HPUX, which are the > platforms I've tested most. As an experiment I reduced the listen() > parameter to 5 on HPUX, and bingo: I get connection-refused failures > in "make check". So it seems that Windows' behavior is not so out of > line after all. We would probably see similar failures on BSD-derived > systems, since BSD systems traditionally set SOMAXCONN to 5. (Any > BSD partisans able to check this?) BSDi 4.01 has: /* * Maximum queue length specifiable by listen. * The kernel has a configurable limit; * the non-kernel value is the traditional one. */ #ifndef KERNEL #define SOMAXCONN 64 /* XXX, really run-time settable */ #else #ifndef _POSIX_SOURCE #define SOMAXCONN_DFLT 64 #endif #endif and sysctl has: net.socket.maxconn = 64 that can be easily changed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> In the longer term, we should think about whether we can reduce the > postmaster's connection service delay. Someone recently suggested > that the postmaster should fork a child immediately upon receiving > a connection, and let the child work on the authentication process > while the parent goes right back to accept(). I'm not sure if that > would help "make check" very much, since it's presumably not running > anything more complex than "trust" authentication anyway. But it > should eliminate auth delays caused by SSL, malfunctioning ident > daemons, and sundry other problems. I think the trust for SSL/indent would be a good idea. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026