Thread: cuckoo is hung during regression test

cuckoo is hung during regression test

From
"Jim C. Nasby"
Date:
The 8.1 build for cuckoo is currently hung, with the *postmaster* taking
all the CPU it can get. The build started almost 5 hours ago.

The postmaster is stuck in the following loop, according to
ktrace/kdump:
 2023 postgres RET   write 59/0x3b 2023 postgres CALL  close(0xffffffff) 2023 postgres RET   close -1 errno 9 Bad file
descriptor2023 postgres CALL  sigprocmask(0x3,0x2e6400,0) 2023 postgres RET   sigprocmask 0 2023 postgres CALL
select(0x8,0xbfffe194,0,0,0xbfffe16c)2023 postgres RET   select 1 2023 postgres CALL  sigprocmask(0x3,0x2f0d38,0) 2023
postgresRET   sigprocmask 0 2023 postgres CALL  accept(0x7,0x200148c,0x200150c) 2023 postgres RET   accept -1 errno 24
Toomany open files 2023 postgres CALL  write(0x2,0x2003928,0x3b) 2023 postgres GIO   fd 2 wrote 59 bytes      "LOG:
couldnot accept new connection: Too many open files      " 2023 postgres RET   write 59/0x3b 2023 postgres CALL
close(0xffffffff)2023 postgres RET   close -1 errno 9 Bad file descriptor 2023 postgres CALL
sigprocmask(0x3,0x2e6400,0)2023 postgres RET   sigprocmask 0 2023 postgres CALL  select(0x8,0xbfffe194,0,0,0xbfffe16c)
2023postgres RET   select 1 2023 postgres CALL  sigprocmask(0x3,0x2f0d38,0) 2023 postgres RET   sigprocmask 0 2023
postgresCALL  accept(0x7,0x200148c,0x200150c) 2023 postgres RET   accept -1 errno 24 Too many open files 2023 postgres
CALL write(0x2,0x200381c,0x3b) 2023 postgres GIO   fd 2 wrote 59 bytes      "LOG:  could not accept new connection: Too
manyopen files      " 2023 postgres RET   write 59/0x3b
 

ulimit is set to 1224 open files, though I seem to keep bumping into that
(anyone know what the system-level limit is, or how to change it?)

Is there other useful info to be had about this process, or should I just kill
it?
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: cuckoo is hung during regression test

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> The postmaster is stuck in the following loop, according to
> ktrace/kdump:

>   2023 postgres CALL  select(0x8,0xbfffe194,0,0,0xbfffe16c)
>   2023 postgres RET   select 1
>   2023 postgres CALL  sigprocmask(0x3,0x2f0d38,0)
>   2023 postgres RET   sigprocmask 0
>   2023 postgres CALL  accept(0x7,0x200148c,0x200150c)
>   2023 postgres RET   accept -1 errno 24 Too many open files
>   2023 postgres CALL  write(0x2,0x2003928,0x3b)
>   2023 postgres GIO   fd 2 wrote 59 bytes
>        "LOG:  could not accept new connection: Too many open files
>        "
>   2023 postgres RET   write 59/0x3b
>   2023 postgres CALL  close(0xffffffff)
>   2023 postgres RET   close -1 errno 9 Bad file descriptor
>   2023 postgres CALL  sigprocmask(0x3,0x2e6400,0)
>   2023 postgres RET   sigprocmask 0
>   2023 postgres CALL  select(0x8,0xbfffe194,0,0,0xbfffe16c)
>   2023 postgres RET   select 1

Interesting.  So accept() fails because it can't allocate an FD, which
means that the select condition isn't cleared, so we keep retrying
forever.  I don't see what else we could do though.  Having the
postmaster abort on what might well be a transient condition doesn't
sound like a hot idea.  We could possibly sleep() a bit before retrying,
just to not suck 100% CPU, but that doesn't really *fix* anything ...

I've been meaning to bug you about increasing cuckoo's FD limit anyway;
it keeps failing in the regression tests.

> ulimit is set to 1224 open files, though I seem to keep bumping into that
> (anyone know what the system-level limit is, or how to change it?)

On my OS X machine, "ulimit -n unlimited" seems to set the limit to
10240 (or so a subsequent ulimit -a reports).  But you could probably
fix it using the buildfarm parameter that cuts the number of concurrent
regression test runs.
        regards, tom lane


Re: cuckoo is hung during regression test

From
Jim Nasby
Date:
On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
> Interesting.  So accept() fails because it can't allocate an FD, which
> means that the select condition isn't cleared, so we keep retrying
> forever.  I don't see what else we could do though.  Having the
> postmaster abort on what might well be a transient condition doesn't
> sound like a hot idea.  We could possibly sleep() a bit before  
> retrying,
> just to not suck 100% CPU, but that doesn't really *fix* anything ...

Well, not only that, but the machine is currently writing to the  
postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep  
(perhaps growing exponentially to some limit) would be a good idea.

> I've been meaning to bug you about increasing cuckoo's FD limit  
> anyway;
> it keeps failing in the regression tests.
>
>> ulimit is set to 1224 open files, though I seem to keep bumping  
>> into that
>> (anyone know what the system-level limit is, or how to change it?)
>
> On my OS X machine, "ulimit -n unlimited" seems to set the limit to
> 10240 (or so a subsequent ulimit -a reports).  But you could probably
> fix it using the buildfarm parameter that cuts the number of  
> concurrent
> regression test runs.

Odd... that works on my MBP (sudo bash; ulimit -n unlimited) and I  
get 12288. But the same thing doesn't work on cuckoo, which is a G4;  
the limit stays at 1224 no matter what. Perhaps because I'm setting  
maxfiles in launchd.conf.

In any case, I've upped it to a bit over 2k; we'll see what that  
does. I find it interesting that aubrac isn't affected by this, since  
it's still running with the default of only 256 open files.

I'm thinking we might want to change the default value for  
max_files_per_process on OS X, or have initdb test it like it does  
for other things.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: cuckoo is hung during regression test

From
Tom Lane
Date:
Jim Nasby <jim@nasby.net> writes:
> On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
>> We could possibly sleep() a bit before retrying,
>> just to not suck 100% CPU, but that doesn't really *fix* anything ...

> Well, not only that, but the machine is currently writing to the  
> postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep  
> (perhaps growing exponentially to some limit) would be a good idea.

Well, since the code has always behaved that way and no one noticed
before, I don't think it's worth anything as complicated as a variable
delay.  I just stuck a fixed 100msec delay into the accept-failed code
path.
        regards, tom lane


Re: cuckoo is hung during regression test

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Jim Nasby <jim@nasby.net> writes:
> > On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
> >> We could possibly sleep() a bit before retrying,
> >> just to not suck 100% CPU, but that doesn't really *fix* anything ...
> 
> > Well, not only that, but the machine is currently writing to the  
> > postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep  
> > (perhaps growing exponentially to some limit) would be a good idea.
> 
> Well, since the code has always behaved that way and no one noticed
> before, I don't think it's worth anything as complicated as a variable
> delay.  I just stuck a fixed 100msec delay into the accept-failed code
> path.

Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
(And so does the autovac code I'm currently looking at).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: cuckoo is hung during regression test

From
Andrew Dunstan
Date:
Alvaro Herrera wrote:
> Tom Lane wrote:
>   
>> Jim Nasby <jim@nasby.net> writes:
>>     
>>> On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
>>>       
>>>> We could possibly sleep() a bit before retrying,
>>>> just to not suck 100% CPU, but that doesn't really *fix* anything ...
>>>>         
>>> Well, not only that, but the machine is currently writing to the  
>>> postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep  
>>> (perhaps growing exponentially to some limit) would be a good idea.
>>>       
>> Well, since the code has always behaved that way and no one noticed
>> before, I don't think it's worth anything as complicated as a variable
>> delay.  I just stuck a fixed 100msec delay into the accept-failed code
>> path.
>>     
>
> Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
> (And so does the autovac code I'm currently looking at).
>
>   

There is probably a good case for a shorter delay in postmaster, though.

cheers

andrew


Re: cuckoo is hung during regression test

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>> delay.  I just stuck a fixed 100msec delay into the accept-failed code
>>> path.
>> 
>> Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
>> (And so does the autovac code I'm currently looking at).

> There is probably a good case for a shorter delay in postmaster, though.

Yeah, that's what I thought.  We don't really care if either bgwriter or
autovac goes AWOL for a little while, but if the postmaster's asleep
then nobody can connect.
        regards, tom lane