Thread: buildfarm failures on smew and anole

buildfarm failures on smew and anole

From

Robert Haas

Date:

11 October 2013, 19:33:26

The build is continuing to fail on smew and anole.  The reason it's
failing is because those machines are choosing max_connections = 10,
which is not enough to run the regression tests.  I think this is
probably because of System V semaphore exhaustion.  The machines are
not choosing a small value for shared_buffers - they're still picking
128MB - so the problem is not the operating system's shared memory
limit.  But it might be that the operating system is short on some
other resource that prevents starting up with a more normal value for
max_connections.  My best guess is System V semaphores; I think that
one of the failed runs caused by the dynamic shared memory patch
probably left a bunch of semaphores allocated, so the build will keep
failing until those are manually cleaned up.

Can the owners of these buildfarm machines please check whether there
are extra semaphores allocated and if so free them?  Or at least
reboot, to see if that unbreaks the build?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Andrew Dunstan

Date:

11 October 2013, 20:03:14

On 10/11/2013 03:33 PM, Robert Haas wrote:
> The build is continuing to fail on smew and anole.  The reason it's
> failing is because those machines are choosing max_connections = 10,
> which is not enough to run the regression tests.  I think this is
> probably because of System V semaphore exhaustion.  The machines are
> not choosing a small value for shared_buffers - they're still picking
> 128MB - so the problem is not the operating system's shared memory
> limit.  But it might be that the operating system is short on some
> other resource that prevents starting up with a more normal value for
> max_connections.  My best guess is System V semaphores; I think that
> one of the failed runs caused by the dynamic shared memory patch
> probably left a bunch of semaphores allocated, so the build will keep
> failing until those are manually cleaned up.
>
> Can the owners of these buildfarm machines please check whether there
> are extra semaphores allocated and if so free them?  Or at least
> reboot, to see if that unbreaks the build?
>

It is possible to set the buildfarm config
    build_env=> {MAX_CONNECTIONS => 10 },

and the tests will run with that constraint.

Not sure if this would help.

cheers

andrew

Re: buildfarm failures on smew and anole

From

Robert Haas

Date:

14 October 2013, 13:19:19

On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> Can the owners of these buildfarm machines please check whether there
>> are extra semaphores allocated and if so free them?  Or at least
>> reboot, to see if that unbreaks the build?
>
> It is possible to set the buildfarm config
>
>     build_env=> {MAX_CONNECTIONS => 10 },
>
> and the tests will run with that constraint.
>
> Not sure if this would help.

Maybe I didn't explain that well.  The problem is that the regression
tests require at least 20 connections to run, and those two machines
are currently auto-selecting 10 connections, so make check is failing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Andres Freund

Date:

14 October 2013, 13:22:33

On 2013-10-14 09:12:09 -0400, Robert Haas wrote:
> On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> >> Can the owners of these buildfarm machines please check whether there
> >> are extra semaphores allocated and if so free them?  Or at least
> >> reboot, to see if that unbreaks the build?
> >
> > It is possible to set the buildfarm config
> >
> >     build_env=> {MAX_CONNECTIONS => 10 },
> >
> > and the tests will run with that constraint.
> >
> > Not sure if this would help.
> 
> Maybe I didn't explain that well.  The problem is that the regression
> tests require at least 20 connections to run, and those two machines
> are currently auto-selecting 10 connections, so make check is failing.

I think pg_regress has support for spreading groups to fewer connections
if max_connections is set appropriately. I guess that's what Andrew is
referring to.

That said, I don't think that's the solution here. The machine clearly
worked with more connections until recently.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: buildfarm failures on smew and anole

From

Andrew Dunstan

Date:

14 October 2013, 13:22:35

On 10/14/2013 09:12 AM, Robert Haas wrote:
> On Fri, Oct 11, 2013 at 4:03 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>> Can the owners of these buildfarm machines please check whether there
>>> are extra semaphores allocated and if so free them?  Or at least
>>> reboot, to see if that unbreaks the build?
>> It is possible to set the buildfarm config
>>
>>      build_env=> {MAX_CONNECTIONS => 10 },
>>
>> and the tests will run with that constraint.
>>
>> Not sure if this would help.
> Maybe I didn't explain that well.  The problem is that the regression
> tests require at least 20 connections to run, and those two machines
> are currently auto-selecting 10 connections, so make check is failing.
>

Why do they need 20 connections? pg_regress has code in it to limit the 
degree of parallelism of tests, and has done for years, specifically to 
cater for buildfarm machines that are unable to handle the defaults. 
Using this option in the buildfarm client config triggers use of this 
feature.

cheers

andrew

Re: buildfarm failures on smew and anole

From

Robert Haas

Date:

14 October 2013, 13:28:09

On Mon, Oct 14, 2013 at 9:22 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> Maybe I didn't explain that well.  The problem is that the regression
>> tests require at least 20 connections to run, and those two machines
>> are currently auto-selecting 10 connections, so make check is failing.
>
> Why do they need 20 connections? pg_regress has code in it to limit the
> degree of parallelism of tests, and has done for years, specifically to
> cater for buildfarm machines that are unable to handle the defaults. Using
> this option in the buildfarm client config triggers use of this feature.

Hmm, I wasn't aware of that.  I thought they needed 20 connections
because parallel_schedule says:

# By convention, we put no more than twenty tests in any one parallel group;
# this limits the number of connections needed to run the tests.

If it's not supposed to matter how many connections are available,
then that comment is misleading.  But I think it does matter, at least
in some situations, because otherwise these machines wouldn't be
failing with "sorry, too many clients already".

Anyway, as Andres said, the machines were working fine until recently,
so I think we just need to get them un-broken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Andres Freund

Date:

14 October 2013, 13:29:56

On 2013-10-14 09:28:04 -0400, Robert Haas wrote:
> # By convention, we put no more than twenty tests in any one parallel group;
> # this limits the number of connections needed to run the tests.
> 
> If it's not supposed to matter how many connections are available,
> then that comment is misleading.  But I think it does matter, at least
> in some situations, because otherwise these machines wouldn't be
> failing with "sorry, too many clients already".

Well, you need to explicitly pass --max-connections to pg_regress.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: buildfarm failures on smew and anole

From

Tom Lane

Date:

14 October 2013, 17:33:55

Robert Haas <robertmhaas@gmail.com> writes:
> Anyway, as Andres said, the machines were working fine until recently,
> so I think we just need to get them un-broken.

I think you're talking past each other.  What would be useful here is
to find out *why* these machines are now failing, when they didn't before.
There might or might not be anything useful to be done about it, but if
we don't have that information, we can't tell.
        regards, tom lane

Re: buildfarm failures on smew and anole

From

Robert Haas

Date:

14 October 2013, 18:06:41

On Mon, Oct 14, 2013 at 1:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Anyway, as Andres said, the machines were working fine until recently,
>> so I think we just need to get them un-broken.
>
> I think you're talking past each other.  What would be useful here is
> to find out *why* these machines are now failing, when they didn't before.
> There might or might not be anything useful to be done about it, but if
> we don't have that information, we can't tell.

Well, my OP had a working theory which I think fits the facts, and
some suggested troubleshooting steps.  How about that for a start?

The real problem here is that neither of the buildfarm owners has
responded to this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Peter Eisentraut

Date:

14 October 2013, 20:30:02

On Fri, 2013-10-11 at 15:33 -0400, Robert Haas wrote:
> Can the owners of these buildfarm machines please check whether there
> are extra semaphores allocated and if so free them?  Or at least
> reboot, to see if that unbreaks the build? 

I cleaned the semaphores on smew, but they came back.  Whatever is
crashing is leaving the semaphores lying around.

Re: buildfarm failures on smew and anole

From

Robert Haas

Date:

14 October 2013, 22:14:32

On Mon, Oct 14, 2013 at 4:29 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Fri, 2013-10-11 at 15:33 -0400, Robert Haas wrote:
>> Can the owners of these buildfarm machines please check whether there
>> are extra semaphores allocated and if so free them?  Or at least
>> reboot, to see if that unbreaks the build?
>
> I cleaned the semaphores on smew, but they came back.  Whatever is
> crashing is leaving the semaphores lying around.

Ugh.  When did you do that exactly?  I thought I fixed the problem
that was causing that days ago, and the last 4 days worth of runs all
show the "too many clients" error.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Peter Eisentraut

Date:

16 October 2013, 03:17:53

On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:
> > I cleaned the semaphores on smew, but they came back.  Whatever is
> > crashing is leaving the semaphores lying around.
> 
> Ugh.  When did you do that exactly?  I thought I fixed the problem
> that was causing that days ago, and the last 4 days worth of runs all
> show the "too many clients" error.

I did it a few times over the weekend.  At least twice less than 4 days
ago.  There are currently no semaphores left around, so whatever
happened in the last run cleaned it up.

Re: buildfarm failures on smew and anole

From

Robert Haas

Date:

16 October 2013, 12:39:18

On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:
>> > I cleaned the semaphores on smew, but they came back.  Whatever is
>> > crashing is leaving the semaphores lying around.
>>
>> Ugh.  When did you do that exactly?  I thought I fixed the problem
>> that was causing that days ago, and the last 4 days worth of runs all
>> show the "too many clients" error.
>
> I did it a few times over the weekend.  At least twice less than 4 days
> ago.  There are currently no semaphores left around, so whatever
> happened in the last run cleaned it up.

That seems to suggest I've introduced some bug.  I'm at a loss as to
what it is, though.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Andres Freund

Date:

16 October 2013, 12:54:32

On 2013-10-16 08:39:10 -0400, Robert Haas wrote:
> On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> > On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:
> >> > I cleaned the semaphores on smew, but they came back.  Whatever is
> >> > crashing is leaving the semaphores lying around.
> >>
> >> Ugh.  When did you do that exactly?  I thought I fixed the problem
> >> that was causing that days ago, and the last 4 days worth of runs all
> >> show the "too many clients" error.
> >
> > I did it a few times over the weekend.  At least twice less than 4 days
> > ago.  There are currently no semaphores left around, so whatever
> > happened in the last run cleaned it up.
> 
> That seems to suggest I've introduced some bug.  I'm at a loss as to
> what it is, though.  :-(

Ah. I see the issue. To reproduce do something like
# mkdir /tmp/empty
# mount --bind /tmp/empty /dev/shm/
and then run initdb.

The issue is that test_config_settings determines max_connections
without disabling dynamic shared memory which consequently chooses posix
which doesn't work. Setting it to none during the test makes it work.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: buildfarm failures on smew and anole

From

Robert Haas

Date:

16 October 2013, 13:35:50

On Wed, Oct 16, 2013 at 8:54 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-16 08:39:10 -0400, Robert Haas wrote:
>> On Tue, Oct 15, 2013 at 11:17 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> > On Mon, 2013-10-14 at 18:14 -0400, Robert Haas wrote:
>> >> > I cleaned the semaphores on smew, but they came back.  Whatever is
>> >> > crashing is leaving the semaphores lying around.
>> >>
>> >> Ugh.  When did you do that exactly?  I thought I fixed the problem
>> >> that was causing that days ago, and the last 4 days worth of runs all
>> >> show the "too many clients" error.
>> >
>> > I did it a few times over the weekend.  At least twice less than 4 days
>> > ago.  There are currently no semaphores left around, so whatever
>> > happened in the last run cleaned it up.
>>
>> That seems to suggest I've introduced some bug.  I'm at a loss as to
>> what it is, though.  :-(
>
> Ah. I see the issue. To reproduce do something like
> # mkdir /tmp/empty
> # mount --bind /tmp/empty /dev/shm/
> and then run initdb.
>
> The issue is that test_config_settings determines max_connections
> without disabling dynamic shared memory which consequently chooses posix
> which doesn't work. Setting it to none during the test makes it work.

Gah.  I fixed one instance of that problem in test_config_settings(),
but missed the other.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Robert Haas

Date:

16 October 2013, 13:44:40

On Wed, Oct 16, 2013 at 9:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-16 09:35:46 -0400, Robert Haas wrote:
>> Gah.  I fixed one instance of that problem in test_config_settings(),
>> but missed the other.
>
> Maybe it'd be better to default to none, just as max_connections
> defaults to 1 and shared_buffers to 16? As we write out the value in the
> config file, everything should still continue to work.

Hmm, possibly.  But how would we document that?  It seems strange to
say that the default is none, but the actual setting probably won't be
none on your system because we hack up postgresql.conf.
shared_buffers pretty much just glosses over the distinction between
"default" and "what you probably have configured", but I'm not sure
that's actually great policy.

Trivial fixed pushed, for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm failures on smew and anole

From

Andres Freund

Date:

16 October 2013, 13:47:55

On 2013-10-16 09:44:32 -0400, Robert Haas wrote:
> On Wed, Oct 16, 2013 at 9:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-16 09:35:46 -0400, Robert Haas wrote:
> >> Gah.  I fixed one instance of that problem in test_config_settings(),
> >> but missed the other.
> >
> > Maybe it'd be better to default to none, just as max_connections
> > defaults to 1 and shared_buffers to 16? As we write out the value in the
> > config file, everything should still continue to work.
> 
> Hmm, possibly.  But how would we document that?  It seems strange to
> say that the default is none, but the actual setting probably won't be
> none on your system because we hack up postgresql.conf.
> shared_buffers pretty much just glosses over the distinction between
> "default" and "what you probably have configured", but I'm not sure
> that's actually great policy.

I can't remember somebody actually being confused by that with s_b or
max_connections. So maybe it's just ok not to document it. But yes, I
can't come up with a succinct description of that behaviour either.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: buildfarm failures on smew and anole

From

Andres Freund

Date:

17 October 2013, 13:01:01

On 2013-10-16 09:35:46 -0400, Robert Haas wrote:
> Gah.  I fixed one instance of that problem in test_config_settings(),
> but missed the other.

Maybe it'd be better to default to none, just as max_connections
defaults to 1 and shared_buffers to 16? As we write out the value in the
config file, everything should still continue to work.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services