Thread: troubleshooting 8.1.2
We have 4 8.1.2 cluster running on an HP-UX 11.23 Itanium, repeatedly dying with the following log message: 2006-07-11 12:52:27 EDT [21582] LOG: received fast shutdown request 2006-07-11 12:52:27 EDT [21591] LOG: shutting down 2006-07-11 12:52:27 EDT [21591] LOG: database system is shut down 2006-07-11 12:52:27 EDT [21584] LOG: logger shutting down We can't figure out why it is shutting down. Nobody here is sending the signal. We don't have any cron jobs doing that sort of thing. We've also seen out of memory errors when this first started happening, though glance had not shown GBL_MEM_UTIL above 90% (with OS buffer cache max/min percents at 10%/3%). The box has 64gb of RAM, so that would seem to mean there was ~6GB of RAM available when it got the out of memory errors. Just in case, we shutdown several clusters, and restarted them, and now even with plentiful memory, they're dying with the same message. Any ideas? Ed
On Tuesday July 11 2006 1:17 pm, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > We have 4 8.1.2 cluster running on an HP-UX 11.23 Itanium, > > repeatedly dying with the following log message: > > > > 2006-07-11 12:52:27 EDT [21582] LOG: received fast > > shutdown request > > *Something* is sending SIGINT to the postmaster --- it's > simply not possible to reach that elog call any other way. > > How are you launching the postmaster? If from a terminal > window, are you sure it's entirely disconnected from the > terminal's process group? If not, typing control-C in that > window could SIGINT the postmaster. We use a shell function to start the postmaster: dbstart() { pg_ctl start -D $PGDATA -m smart -o "-i -p $PGPORT" -p postmaster } We are wondering if our swap space was too small, and when the swap reservation failed, the OS was sending SIGINT?? Ed
"Ed L." <pgsql@bluepolka.net> writes: > We have 4 8.1.2 cluster running on an HP-UX 11.23 Itanium, repeatedly > dying with the following log message: > 2006-07-11 12:52:27 EDT [21582] LOG: received fast shutdown request *Something* is sending SIGINT to the postmaster --- it's simply not possible to reach that elog call any other way. How are you launching the postmaster? If from a terminal window, are you sure it's entirely disconnected from the terminal's process group? If not, typing control-C in that window could SIGINT the postmaster. regards, tom lane
"Ed L." <pgsql@bluepolka.net> writes: > We are wondering if our swap space was too small, and when the > swap reservation failed, the OS was sending SIGINT?? I've never heard of an OS sending that particular signal for a memory shortage. 'strace' may be your friend here. -Doug
"Ed L." <pgsql@bluepolka.net> writes: > We are wondering if our swap space was too small, and when the > swap reservation failed, the OS was sending SIGINT?? You'd have to check your OS documentation ... I thought HPUX would just return ENOMEM to brk() for such cases. It doesn't do memory overcommit does it? regards, tom lane
On Tuesday July 11 2006 3:16 pm, Tom Lane wrote: > "Ed L." <pgsql@bluepolka.net> writes: > > We are wondering if our swap space was too small, and when > > the swap reservation failed, the OS was sending SIGINT?? > > You'd have to check your OS documentation ... I thought HPUX > would just return ENOMEM to brk() for such cases. It doesn't > do memory overcommit does it? ENOMEM is correct for our brk(), too. We're running with psuedoswap, but I guess our swapspace was too small, and appears to be what we ran into. The SIGINT is still a mystery. Our truss output for one of these SIGINTs is at the bottom of this message, for what its worth. BTW, here's a conversation of possible interest that conflicts with advice I've heard here of keeping shared_buffers small and letting the OS do all the caching. http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1042336 Their argument appears to be that there are HPUX kernel inefficiencies for OS caches larger than 1.5gb. You once argued that it would be unreasonable to expect user-space shared memory to be any more efficient than the kernel cache. I don't know one way or the other, and solid benchmarking that simulates our loads appears troublesome. I guess I could write a little C program to measure shared memory random access times as the size of the cache grows... Anyway, here's the truss output: ( Attached to process 20787 ("postmaster -D /users/...") [64-bit] ) select(7, 0x9fffffffffffe670, NULL, NULL, 0x9fffffffffffe640) [sleeping] Received signal 2, SIGINT, in select(), [caught], no siginfo sigprocmask(SIG_SETMASK, 0x60000000000708c0, NULL) = 0 gettimeofday(0x9fffffffffff9460, NULL) = 0 stat("/usr/lib/tztab", 0x9fffffffffff9300) = 0 open("/usr/lib/tztab", O_RDONLY|0x800, 01210) = 9 mmap(NULL, 13197, PROT_READ, MAP_PRIVATE, 9, 0) = 0x9fffffffbb14c0 00 close(9) = 0 write(2, "2 0 0 6 - 0 7 - 1 1 1 3 : 5 5 ".., 76) = 76 kill(20793, SIGUSR2) = 0 kill(20794, SIGQUIT) = 0 ... Ed