Thread: buildfarm failures on smew and anole
The build is continuing to fail on smew and anole. The reason it's failing is because those machines are choosing max_connections = 10, which is not enough to run the regression tests. I think this is probably because of System V semaphore exhaustion. The machines are not choosing a small value for shared_buffers - they're still picking 128MB - so the problem is not the operating system's shared memory limit. But it might be that the operating system is short on some other resource that prevents starting up with a more normal value for max_connections. My best guess is System V semaphores; I think that one of the failed runs caused by the dynamic shared memory patch probably left a bunch of semaphores allocated, so the build will keep failing until those are manually cleaned up. Can the owners of these buildfarm machines please check whether there are extra semaphores allocated and if so free them? Or at least reboot, to see if that unbreaks the build? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/11/2013 03:33 PM, Robert Haas wrote: > The build is continuing to fail on smew and anole. The reason it's > failing is because those machines are choosing max_connections = 10, > which is not enough to run the regression tests. I think this is > probably because of System V semaphore exhaustion. The machines are > not choosing a small value for shared_buffers - they're still picking > 128MB - so the problem is not the operating system's shared memory > limit. But it might be that the operating system is short on some > other resource that prevents starting up with a more normal value for > max_connections. My best guess is System V semaphores; I think that > one of the failed runs caused by the dynamic shared memory patch > probably left a bunch of semaphores allocated, so the build will keep > failing until those are manually cleaned up. > > Can the owners of these buildfarm machines please check whether there > are extra semaphores allocated and if so free them? Or at least > reboot, to see if that unbreaks the build? > It is possible to set the buildfarm config build_env=> {MAX_CONNECTIONS => 10 }, and the tests will run with that constraint. Not sure if this would help. cheers andrew
On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> Can the owners of these buildfarm machines please check whether there >> are extra semaphores allocated and if so free them? Or at least >> reboot, to see if that unbreaks the build? > > It is possible to set the buildfarm config > > build_env=> {MAX_CONNECTIONS => 10 }, > > and the tests will run with that constraint. > > Not sure if this would help. Maybe I didn't explain that well. The problem is that the regression tests require at least 20 connections to run, and those two machines are currently auto-selecting 10 connections, so make check is failing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-14 09:12:09 -0400, Robert Haas wrote: > On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > >> Can the owners of these buildfarm machines please check whether there > >> are extra semaphores allocated and if so free them? Or at least > >> reboot, to see if that unbreaks the build? > > > > It is possible to set the buildfarm config > > > > build_env=> {MAX_CONNECTIONS => 10 }, > > > > and the tests will run with that constraint. > > > > Not sure if this would help. > > Maybe I didn't explain that well. The problem is that the regression > tests require at least 20 connections to run, and those two machines > are currently auto-selecting 10 connections, so make check is failing. I think pg_regress has support for spreading groups to fewer connections if max_connections is set appropriately. I guess that's what Andrew is referring to. That said, I don't think that's the solution here. The machine clearly worked with more connections until recently. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10/14/2013 09:12 AM, Robert Haas wrote: > On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >>> Can the owners of these buildfarm machines please check whether there >>> are extra semaphores allocated and if so free them? Or at least >>> reboot, to see if that unbreaks the build? >> It is possible to set the buildfarm config >> >> build_env=> {MAX_CONNECTIONS => 10 }, >> >> and the tests will run with that constraint. >> >> Not sure if this would help. > Maybe I didn't explain that well. The problem is that the regression > tests require at least 20 connections to run, and those two machines > are currently auto-selecting 10 connections, so make check is failing. > Why do they need 20 connections? pg_regress has code in it to limit the degree of parallelism of tests, and has done for years, specifically to cater for buildfarm machines that are unable to handle the defaults. Using this option in the buildfarm client config triggers use of this feature. cheers andrew
On Mon, Oct 14, 2013 at 9:22 AM, Andrew Dunstan <andrew@dunslane.net> wrote: >> Maybe I didn't explain that well. The problem is that the regression >> tests require at least 20 connections to run, and those two machines >> are currently auto-selecting 10 connections, so make check is failing. > > Why do they need 20 connections? pg_regress has code in it to limit the > degree of parallelism of tests, and has done for years, specifically to > cater for buildfarm machines that are unable to handle the defaults. Using > this option in the buildfarm client config triggers use of this feature. Hmm, I wasn't aware of that. I thought they needed 20 connections because parallel_schedule says: # By convention, we put no more than twenty tests in any one parallel group; # this limits the number of connections needed to run the tests. If it's not supposed to matter how many connections are available, then that comment is misleading. But I think it does matter, at least in some situations, because otherwise these machines wouldn't be failing with "sorry, too many clients already". Anyway, as Andres said, the machines were working fine until recently, so I think we just need to get them un-broken. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-14 09:28:04 -0400, Robert Haas wrote: > # By convention, we put no more than twenty tests in any one parallel group; > # this limits the number of connections needed to run the tests. > > If it's not supposed to matter how many connections are available, > then that comment is misleading. But I think it does matter, at least > in some situations, because otherwise these machines wouldn't be > failing with "sorry, too many clients already". Well, you need to explicitly pass --max-connections to pg_regress. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > Anyway, as Andres said, the machines were working fine until recently, > so I think we just need to get them un-broken. I think you're talking past each other. What would be useful here is to find out *why* these machines are now failing, when they didn't before. There might or might not be anything useful to be done about it, but if we don't have that information, we can't tell. regards, tom lane
On Mon, Oct 14, 2013 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Anyway, as Andres said, the machines were working fine until recently, >> so I think we just need to get them un-broken. > > I think you're talking past each other. What would be useful here is > to find out *why* these machines are now failing, when they didn't before. > There might or might not be anything useful to be done about it, but if > we don't have that information, we can't tell. Well, my OP had a working theory which I think fits the facts, and some suggested troubleshooting steps. How about that for a start? The real problem here is that neither of the buildfarm owners has responded to this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2013-10-11 at 15:33 -0400, Robert Haas wrote: > Can the owners of these buildfarm machines please check whether there > are extra semaphores allocated and if so free them? Or at least > reboot, to see if that unbreaks the build? I cleaned the semaphores on smew, but they came back. Whatever is crashing is leaving the semaphores lying around.
On Mon, Oct 14, 2013 at 4:29 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Fri, 2013-10-11 at 15:33 -0400, Robert Haas wrote: >> Can the owners of these buildfarm machines please check whether there >> are extra semaphores allocated and if so free them? Or at least >> reboot, to see if that unbreaks the build? > > I cleaned the semaphores on smew, but they came back. Whatever is > crashing is leaving the semaphores lying around. Ugh. When did you do that exactly? I thought I fixed the problem that was causing that days ago, and the last 4 days worth of runs all show the "too many clients" error. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote: > > I cleaned the semaphores on smew, but they came back. Whatever is > > crashing is leaving the semaphores lying around. > > Ugh. When did you do that exactly? I thought I fixed the problem > that was causing that days ago, and the last 4 days worth of runs all > show the "too many clients" error. I did it a few times over the weekend. At least twice less than 4 days ago. There are currently no semaphores left around, so whatever happened in the last run cleaned it up.
On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote: >> > I cleaned the semaphores on smew, but they came back. Whatever is >> > crashing is leaving the semaphores lying around. >> >> Ugh. When did you do that exactly? I thought I fixed the problem >> that was causing that days ago, and the last 4 days worth of runs all >> show the "too many clients" error. > > I did it a few times over the weekend. At least twice less than 4 days > ago. There are currently no semaphores left around, so whatever > happened in the last run cleaned it up. That seems to suggest I've introduced some bug. I'm at a loss as to what it is, though. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-16 08:39:10 -0400, Robert Haas wrote: > On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > > On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote: > >> > I cleaned the semaphores on smew, but they came back. Whatever is > >> > crashing is leaving the semaphores lying around. > >> > >> Ugh. When did you do that exactly? I thought I fixed the problem > >> that was causing that days ago, and the last 4 days worth of runs all > >> show the "too many clients" error. > > > > I did it a few times over the weekend. At least twice less than 4 days > > ago. There are currently no semaphores left around, so whatever > > happened in the last run cleaned it up. > > That seems to suggest I've introduced some bug. I'm at a loss as to > what it is, though. :-( Ah. I see the issue. To reproduce do something like # mkdir /tmp/empty # mount --bind /tmp/empty /dev/shm/ and then run initdb. The issue is that test_config_settings determines max_connections without disabling dynamic shared memory which consequently chooses posix which doesn't work. Setting it to none during the test makes it work. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 16, 2013 at 8:54 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-10-16 08:39:10 -0400, Robert Haas wrote: >> On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> > On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote: >> >> > I cleaned the semaphores on smew, but they came back. Whatever is >> >> > crashing is leaving the semaphores lying around. >> >> >> >> Ugh. When did you do that exactly? I thought I fixed the problem >> >> that was causing that days ago, and the last 4 days worth of runs all >> >> show the "too many clients" error. >> > >> > I did it a few times over the weekend. At least twice less than 4 days >> > ago. There are currently no semaphores left around, so whatever >> > happened in the last run cleaned it up. >> >> That seems to suggest I've introduced some bug. I'm at a loss as to >> what it is, though. :-( > > Ah. I see the issue. To reproduce do something like > # mkdir /tmp/empty > # mount --bind /tmp/empty /dev/shm/ > and then run initdb. > > The issue is that test_config_settings determines max_connections > without disabling dynamic shared memory which consequently chooses posix > which doesn't work. Setting it to none during the test makes it work. Gah. I fixed one instance of that problem in test_config_settings(), but missed the other. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 16, 2013 at 9:37 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-10-16 09:35:46 -0400, Robert Haas wrote: >> Gah. I fixed one instance of that problem in test_config_settings(), >> but missed the other. > > Maybe it'd be better to default to none, just as max_connections > defaults to 1 and shared_buffers to 16? As we write out the value in the > config file, everything should still continue to work. Hmm, possibly. But how would we document that? It seems strange to say that the default is none, but the actual setting probably won't be none on your system because we hack up postgresql.conf. shared_buffers pretty much just glosses over the distinction between "default" and "what you probably have configured", but I'm not sure that's actually great policy. Trivial fixed pushed, for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-10-16 09:44:32 -0400, Robert Haas wrote: > On Wed, Oct 16, 2013 at 9:37 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-10-16 09:35:46 -0400, Robert Haas wrote: > >> Gah. I fixed one instance of that problem in test_config_settings(), > >> but missed the other. > > > > Maybe it'd be better to default to none, just as max_connections > > defaults to 1 and shared_buffers to 16? As we write out the value in the > > config file, everything should still continue to work. > > Hmm, possibly. But how would we document that? It seems strange to > say that the default is none, but the actual setting probably won't be > none on your system because we hack up postgresql.conf. > shared_buffers pretty much just glosses over the distinction between > "default" and "what you probably have configured", but I'm not sure > that's actually great policy. I can't remember somebody actually being confused by that with s_b or max_connections. So maybe it's just ok not to document it. But yes, I can't come up with a succinct description of that behaviour either. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-10-16 09:35:46 -0400, Robert Haas wrote: > Gah. I fixed one instance of that problem in test_config_settings(), > but missed the other. Maybe it'd be better to default to none, just as max_connections defaults to 1 and shared_buffers to 16? As we write out the value in the config file, everything should still continue to work. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services