Thread: troubleshooting 8.1.2

troubleshooting 8.1.2

From

"Ed L."

Date:

11 July 2006, 18:52:13

We have 4 8.1.2 cluster running on an HP-UX 11.23 Itanium, repeatedly
dying with the following log message:

2006-07-11 12:52:27 EDT [21582]    LOG:  received fast shutdown request
2006-07-11 12:52:27 EDT [21591]    LOG:  shutting down
2006-07-11 12:52:27 EDT [21591]    LOG:  database system is shut down
2006-07-11 12:52:27 EDT [21584]    LOG:  logger shutting down

We can't figure out why it is shutting down.  Nobody here is sending
the signal.  We don't have any cron jobs doing that sort of thing.

We've also seen out of memory errors when this first started
happening, though glance had not shown GBL_MEM_UTIL above 90% (with
OS buffer cache max/min percents at 10%/3%).  The box has 64gb of
RAM, so that would seem to mean there was ~6GB of RAM available
when it got the out of memory errors.

Just in case, we shutdown several clusters, and restarted them, and
now even with plentiful memory, they're dying with the same message.

Any ideas?

Ed

Re: troubleshooting 8.1.2

From

"Ed L."

Date:

11 July 2006, 20:59:14

On Tuesday July 11 2006 1:17 pm, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > We have 4 8.1.2 cluster running on an HP-UX 11.23 Itanium,
> > repeatedly dying with the following log message:
> >
> > 2006-07-11 12:52:27 EDT [21582]    LOG:  received fast
> > shutdown request
>
> *Something* is sending SIGINT to the postmaster --- it's
> simply not possible to reach that elog call any other way.
>
> How are you launching the postmaster?  If from a terminal
> window, are you sure it's entirely disconnected from the
> terminal's process group? If not, typing control-C in that
> window could SIGINT the postmaster.

We use a shell function to start the postmaster:

dbstart() {
    pg_ctl start -D $PGDATA -m smart -o "-i -p $PGPORT" -p postmaster
}

We are wondering if our swap space was too small, and when the
swap reservation failed, the OS was sending SIGINT??

Ed

Re: troubleshooting 8.1.2

From

Tom Lane

Date:

11 July 2006, 21:57:11

"Ed L." <pgsql@bluepolka.net> writes:
> We have 4 8.1.2 cluster running on an HP-UX 11.23 Itanium, repeatedly
> dying with the following log message:

> 2006-07-11 12:52:27 EDT [21582]    LOG:  received fast shutdown request

*Something* is sending SIGINT to the postmaster --- it's simply not
possible to reach that elog call any other way.

How are you launching the postmaster?  If from a terminal window, are
you sure it's entirely disconnected from the terminal's process group?
If not, typing control-C in that window could SIGINT the postmaster.

            regards, tom lane

Re: troubleshooting 8.1.2

From

Douglas McNaught

Date:

11 July 2006, 23:28:38

"Ed L." <pgsql@bluepolka.net> writes:

> We are wondering if our swap space was too small, and when the
> swap reservation failed, the OS was sending SIGINT??

I've never heard of an OS sending that particular signal for a memory
shortage.  'strace' may be your friend here.

-Doug

Re: troubleshooting 8.1.2

From

Tom Lane

Date:

12 July 2006, 00:09:16

"Ed L." <pgsql@bluepolka.net> writes:
> We are wondering if our swap space was too small, and when the
> swap reservation failed, the OS was sending SIGINT??

You'd have to check your OS documentation ...  I thought HPUX would
just return ENOMEM to brk() for such cases.  It doesn't do memory
overcommit does it?

            regards, tom lane

Re: troubleshooting 8.1.2

From

"Ed L."

Date:

15 July 2006, 21:18:38

On Tuesday July 11 2006 3:16 pm, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > We are wondering if our swap space was too small, and when
> > the swap reservation failed, the OS was sending SIGINT??
>
> You'd have to check your OS documentation ...  I thought HPUX
> would just return ENOMEM to brk() for such cases.  It doesn't
> do memory overcommit does it?

ENOMEM is correct for our brk(), too.  We're running with
psuedoswap, but I guess our swapspace was too small, and appears
to be what we ran into.  The SIGINT is still a mystery.  Our
truss output for one of these SIGINTs is at the bottom of this
message, for what its worth.

BTW, here's a conversation of possible interest that conflicts
with advice I've heard here of keeping shared_buffers small
and letting the OS do all the caching.

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1042336

Their argument appears
to be that there are HPUX kernel inefficiencies for OS caches
larger than 1.5gb.  You once argued that it would be unreasonable
to expect user-space shared memory to be any more efficient than
the kernel cache.  I don't know one way or the other, and
solid benchmarking that simulates our loads appears troublesome.
I guess I could write a little C program to measure shared
memory random access times as the size of the cache grows...

Anyway, here's the truss output:

( Attached to process 20787 ("postmaster -D /users/...") [64-bit] )
select(7, 0x9fffffffffffe670, NULL, NULL, 0x9fffffffffffe640)
                    [sleeping] 
  Received signal 2, SIGINT, in select(), [caught], no siginfo
sigprocmask(SIG_SETMASK, 0x60000000000708c0, NULL)
                    = 0 
gettimeofday(0x9fffffffffff9460, NULL)
                    = 0 
stat("/usr/lib/tztab", 0x9fffffffffff9300)
                    = 0 
open("/usr/lib/tztab", O_RDONLY|0x800, 01210)
                    = 9 
mmap(NULL, 13197, PROT_READ, MAP_PRIVATE, 9, 0)
                    = 0x9fffffffbb14c0 
00
close(9)
                    = 0 
write(2, "2 0 0 6 - 0 7 - 1 1   1 3 : 5 5 ".., 76)
                    = 76 
kill(20793, SIGUSR2)
                    = 0 
kill(20794, SIGQUIT)
                    = 0 
...

Ed