Thread: cuckoo is hung during regression test
The 8.1 build for cuckoo is currently hung, with the *postmaster* taking all the CPU it can get. The build started almost 5 hours ago. The postmaster is stuck in the following loop, according to ktrace/kdump: 2023 postgres RET write 59/0x3b 2023 postgres CALL close(0xffffffff) 2023 postgres RET close -1 errno 9 Bad file descriptor2023 postgres CALL sigprocmask(0x3,0x2e6400,0) 2023 postgres RET sigprocmask 0 2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)2023 postgres RET select 1 2023 postgres CALL sigprocmask(0x3,0x2f0d38,0) 2023 postgresRET sigprocmask 0 2023 postgres CALL accept(0x7,0x200148c,0x200150c) 2023 postgres RET accept -1 errno 24 Toomany open files 2023 postgres CALL write(0x2,0x2003928,0x3b) 2023 postgres GIO fd 2 wrote 59 bytes "LOG: couldnot accept new connection: Too many open files " 2023 postgres RET write 59/0x3b 2023 postgres CALL close(0xffffffff)2023 postgres RET close -1 errno 9 Bad file descriptor 2023 postgres CALL sigprocmask(0x3,0x2e6400,0)2023 postgres RET sigprocmask 0 2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c) 2023postgres RET select 1 2023 postgres CALL sigprocmask(0x3,0x2f0d38,0) 2023 postgres RET sigprocmask 0 2023 postgresCALL accept(0x7,0x200148c,0x200150c) 2023 postgres RET accept -1 errno 24 Too many open files 2023 postgres CALL write(0x2,0x200381c,0x3b) 2023 postgres GIO fd 2 wrote 59 bytes "LOG: could not accept new connection: Too manyopen files " 2023 postgres RET write 59/0x3b ulimit is set to 1224 open files, though I seem to keep bumping into that (anyone know what the system-level limit is, or how to change it?) Is there other useful info to be had about this process, or should I just kill it? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > The postmaster is stuck in the following loop, according to > ktrace/kdump: > 2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c) > 2023 postgres RET select 1 > 2023 postgres CALL sigprocmask(0x3,0x2f0d38,0) > 2023 postgres RET sigprocmask 0 > 2023 postgres CALL accept(0x7,0x200148c,0x200150c) > 2023 postgres RET accept -1 errno 24 Too many open files > 2023 postgres CALL write(0x2,0x2003928,0x3b) > 2023 postgres GIO fd 2 wrote 59 bytes > "LOG: could not accept new connection: Too many open files > " > 2023 postgres RET write 59/0x3b > 2023 postgres CALL close(0xffffffff) > 2023 postgres RET close -1 errno 9 Bad file descriptor > 2023 postgres CALL sigprocmask(0x3,0x2e6400,0) > 2023 postgres RET sigprocmask 0 > 2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c) > 2023 postgres RET select 1 Interesting. So accept() fails because it can't allocate an FD, which means that the select condition isn't cleared, so we keep retrying forever. I don't see what else we could do though. Having the postmaster abort on what might well be a transient condition doesn't sound like a hot idea. We could possibly sleep() a bit before retrying, just to not suck 100% CPU, but that doesn't really *fix* anything ... I've been meaning to bug you about increasing cuckoo's FD limit anyway; it keeps failing in the regression tests. > ulimit is set to 1224 open files, though I seem to keep bumping into that > (anyone know what the system-level limit is, or how to change it?) On my OS X machine, "ulimit -n unlimited" seems to set the limit to 10240 (or so a subsequent ulimit -a reports). But you could probably fix it using the buildfarm parameter that cuts the number of concurrent regression test runs. regards, tom lane
On Feb 13, 2007, at 12:15 PM, Tom Lane wrote: > Interesting. So accept() fails because it can't allocate an FD, which > means that the select condition isn't cleared, so we keep retrying > forever. I don't see what else we could do though. Having the > postmaster abort on what might well be a transient condition doesn't > sound like a hot idea. We could possibly sleep() a bit before > retrying, > just to not suck 100% CPU, but that doesn't really *fix* anything ... Well, not only that, but the machine is currently writing to the postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep (perhaps growing exponentially to some limit) would be a good idea. > I've been meaning to bug you about increasing cuckoo's FD limit > anyway; > it keeps failing in the regression tests. > >> ulimit is set to 1224 open files, though I seem to keep bumping >> into that >> (anyone know what the system-level limit is, or how to change it?) > > On my OS X machine, "ulimit -n unlimited" seems to set the limit to > 10240 (or so a subsequent ulimit -a reports). But you could probably > fix it using the buildfarm parameter that cuts the number of > concurrent > regression test runs. Odd... that works on my MBP (sudo bash; ulimit -n unlimited) and I get 12288. But the same thing doesn't work on cuckoo, which is a G4; the limit stays at 1224 no matter what. Perhaps because I'm setting maxfiles in launchd.conf. In any case, I've upped it to a bit over 2k; we'll see what that does. I find it interesting that aubrac isn't affected by this, since it's still running with the default of only 256 open files. I'm thinking we might want to change the default value for max_files_per_process on OS X, or have initdb test it like it does for other things. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby <jim@nasby.net> writes: > On Feb 13, 2007, at 12:15 PM, Tom Lane wrote: >> We could possibly sleep() a bit before retrying, >> just to not suck 100% CPU, but that doesn't really *fix* anything ... > Well, not only that, but the machine is currently writing to the > postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep > (perhaps growing exponentially to some limit) would be a good idea. Well, since the code has always behaved that way and no one noticed before, I don't think it's worth anything as complicated as a variable delay. I just stuck a fixed 100msec delay into the accept-failed code path. regards, tom lane
Tom Lane wrote: > Jim Nasby <jim@nasby.net> writes: > > On Feb 13, 2007, at 12:15 PM, Tom Lane wrote: > >> We could possibly sleep() a bit before retrying, > >> just to not suck 100% CPU, but that doesn't really *fix* anything ... > > > Well, not only that, but the machine is currently writing to the > > postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep > > (perhaps growing exponentially to some limit) would be a good idea. > > Well, since the code has always behaved that way and no one noticed > before, I don't think it's worth anything as complicated as a variable > delay. I just stuck a fixed 100msec delay into the accept-failed code > path. Seems worth mentioning that bgwriter sleeps 1 sec in case of failure. (And so does the autovac code I'm currently looking at). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Tom Lane wrote: > >> Jim Nasby <jim@nasby.net> writes: >> >>> On Feb 13, 2007, at 12:15 PM, Tom Lane wrote: >>> >>>> We could possibly sleep() a bit before retrying, >>>> just to not suck 100% CPU, but that doesn't really *fix* anything ... >>>> >>> Well, not only that, but the machine is currently writing to the >>> postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep >>> (perhaps growing exponentially to some limit) would be a good idea. >>> >> Well, since the code has always behaved that way and no one noticed >> before, I don't think it's worth anything as complicated as a variable >> delay. I just stuck a fixed 100msec delay into the accept-failed code >> path. >> > > Seems worth mentioning that bgwriter sleeps 1 sec in case of failure. > (And so does the autovac code I'm currently looking at). > > There is probably a good case for a shorter delay in postmaster, though. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Alvaro Herrera wrote: >> Tom Lane wrote: >>> delay. I just stuck a fixed 100msec delay into the accept-failed code >>> path. >> >> Seems worth mentioning that bgwriter sleeps 1 sec in case of failure. >> (And so does the autovac code I'm currently looking at). > There is probably a good case for a shorter delay in postmaster, though. Yeah, that's what I thought. We don't really care if either bgwriter or autovac goes AWOL for a little while, but if the postmaster's asleep then nobody can connect. regards, tom lane