Thread: unexpected shutdown

unexpected shutdown

From
developer@wexwarez.com
Date:
My database has shutdown several times in the last couple days.  I have no
idea why.  I am running centos and I have not rebooted the server or made
any configuration changes.  I am running postgres 8.2 and it has been
stable since I installed it about 5 months ago.  The databases crashes and
so my software application goes down.  When I restart my application
everything seems to work fine.  But then it crashes again, something
appears to be corrupt.  Here are my logs:


LOG:  server process (PID 501) was terminated by signal 9
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
FATAL:  the database system is in recovery mode
FATAL:  the database system is in recovery mode
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2007-06-17 10:55:32 PDT
LOG:  checkpoint record is at 0/72F41748
LOG:  redo record is at 0/72F41748; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/2638157; next OID: 52761
LOG:  next MultiXactId: 4; next MultiXactOffset: 7
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 0/72F41790
LOG:  redo is not required
LOG:  database system is ready


LOG:  server process (PID 13904) was terminated by signal 9
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2007-06-18 10:09:51 PDT
LOG:  checkpoint record is at 0/73609D18
LOG:  redo record is at 0/73609D18; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/2645768; next OID: 52761
LOG:  next MultiXactId: 4; next MultiXactOffset: 7
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 0/73609D60
LOG:  redo is not required
LOG:  database system is ready



Re: unexpected shutdown

From
"Chris Hoover"
Date:


On 6/18/07, developer@wexwarez.com <developer@wexwarez.com> wrote:
My database has shutdown several times in the last couple days.  I have no
idea why.  I am running centos and I have not rebooted the server or made
any configuration changes.  I am running postgres 8.2 and it has been
stable since I installed it about 5 months ago.  The databases crashes and
so my software application goes down.  When I restart my application
everything seems to work fine.  But then it crashes again, something
appears to be corrupt.  Here are my logs:


LOG:  server process (PID 501) was terminated by signal 9
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
FATAL:  the database system is in recovery mode
FATAL:  the database system is in recovery mode
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2007-06-17 10:55:32 PDT
LOG:  checkpoint record is at 0/72F41748
LOG:  redo record is at 0/72F41748; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/2638157; next OID: 52761
LOG:  next MultiXactId: 4; next MultiXactOffset: 7
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 0/72F41790
LOG:  redo is not required
LOG:  database system is ready


LOG:  server process (PID 13904) was terminated by signal 9
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2007-06-18 10:09:51 PDT
LOG:  checkpoint record is at 0/73609D18
LOG:  redo record is at 0/73609D18; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/2645768; next OID: 52761
LOG:  next MultiXactId: 4; next MultiXactOffset: 7
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 0/73609D60
LOG:  redo is not required
LOG:  database system is ready



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

 

Looks like someone is doing a kill -9 on one of the database processes causing postgres to do a restart. 


LOG:  server process (PID 501) was terminated by signal 9

Re: unexpected shutdown

From
Gregory Stark
Date:
<developer@wexwarez.com> writes:

> My database has shutdown several times in the last couple days.  I have no
> idea why.  I am running centos and I have not rebooted the server or made
> any configuration changes.  I am running postgres 8.2 and it has been
> stable since I installed it about 5 months ago.  The databases crashes and
> so my software application goes down.  When I restart my application
> everything seems to work fine.  But then it crashes again, something
> appears to be corrupt.  Here are my logs:
>
>
> LOG:  server process (PID 501) was terminated by signal 9

Signal 9 is SIGKILL which means something outside Postgres is killing Postgres
processes. Either something is doing kill -9 <pid> of a Postgres pid.

There used to be some OSes that recorded a SIGKILL process was killed because
it had run out of memory, but I'm not sure Linux would report it as a SIGKILL.
What does dmesg say, it doesn't have any OOM messages does it?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: unexpected shutdown

From
developer@wexwarez.com
Date:
> <developer@wexwarez.com> writes:
>
>> My database has shutdown several times in the last couple days.  I have
>> no
>> idea why.  I am running centos and I have not rebooted the server or
>> made
>> any configuration changes.  I am running postgres 8.2 and it has been
>> stable since I installed it about 5 months ago.  The databases crashes
>> and
>> so my software application goes down.  When I restart my application
>> everything seems to work fine.  But then it crashes again, something
>> appears to be corrupt.  Here are my logs:
>>
>>
>> LOG:  server process (PID 501) was terminated by signal 9
>
> Signal 9 is SIGKILL which means something outside Postgres is killing
> Postgres
> processes. Either something is doing kill -9 <pid> of a Postgres pid.
>
> There used to be some OSes that recorded a SIGKILL process was killed
> because
> it had run out of memory, but I'm not sure Linux would report it as a
> SIGKILL.
> What does dmesg say, it doesn't have any OOM messages does it?
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
>

Thanks for the replies...

The box is very secure and I think I can safely say no one did a kill -9
on the postgres process.  The java application that accesses postgres does
sometimes have memory issues but i am surprised this would affect
postgres.    I am surprised linux allowed one process to affect the other
like that. Should i be increasing postgres memory parameters or do you
think this might just indicate the box is overloaded?  Is there anything i
could do logging wise on the postgres side to get a better indication of
what is happening?

thanks


Re: unexpected shutdown

From
Francisco Reyes
Date:
developer@wexwarez.com writes:

> could do logging wise on the postgres side to get a better indication of
> what is happening?

You can increase the levels of loggin and redirect std_error to a file.
Something along the lines of

log_destination = 'stderr'
log_filename = 'postgresql-%Y-%m-%d.log'
log_error_verbosity = verbose
log_min_error_statement = debug1
log_min_messages = info

Re: unexpected shutdown

From
Tom Lane
Date:
developer@wexwarez.com writes:
> My database has shutdown several times in the last couple days.  I have no
> idea why.  I am running centos and I have not rebooted the server or made
> any configuration changes.

So in particular, you didn't disable memory overcommit?

> LOG:  server process (PID 501) was terminated by signal 9

If you didn't issue a manual kill -9, then this is almost certainly a
trace of the kernel OOM killer at work.  Google for "OOM kill" to learn
more, or see "memory overcommit" in the PG docs.

Memory overcommit is evil on a server.

            regards, tom lane

Re: unexpected shutdown

From
"Alexander Staubo"
Date:
On 6/18/07, developer@wexwarez.com <developer@wexwarez.com> wrote:
> My database has shutdown several times in the last couple days.  I have no
> idea why.
[...]
> LOG:  server process (PID 501) was terminated by signal 9

If this is Linux, check the kernel log (typically /var/log/kern.log,
or run dmesg) and look for lines like these, which indicate that the
kernel forcibly killed the process:

May 22 12:43:24 sultan kernel: [232933.420709] Out of Memory: Killed
process 5345 (postgres).

Alexander.

Re: unexpected shutdown

From
developer@wexwarez.com
Date:
> developer@wexwarez.com writes:
>> My database has shutdown several times in the last couple days.  I have
>> no
>> idea why.  I am running centos and I have not rebooted the server or
>> made
>> any configuration changes.
>
> So in particular, you didn't disable memory overcommit?
>
>> LOG:  server process (PID 501) was terminated by signal 9
>
> If you didn't issue a manual kill -9, then this is almost certainly a
> trace of the kernel OOM killer at work.  Google for "OOM kill" to learn
> more, or see "memory overcommit" in the PG docs.
>
> Memory overcommit is evil on a server.
>
>             regards, tom lane
>


You guys were right
:Jun 17 11:04:57 kernel: Out of Memory: Killed process 24928 (postmaster).

I did not disable memory overcommit.  I guess this is something I will
have to do.  I have actually never seen this before or heard of memory
overcommit.  I am surprised a setting like this comes enabled by default.
I read a bit about it and it seems to make sense to disable it, but from
practical experience do you know of any negative side effects?


Re: unexpected shutdown

From
Marco Colombo
Date:
developer@wexwarez.com wrote:
>> developer@wexwarez.com writes:
>>> My database has shutdown several times in the last couple days.  I have
>>> no
>>> idea why.  I am running centos and I have not rebooted the server or
>>> made
>>> any configuration changes.

Oh, I forgot. You do have plenty of swap space compared to RAM, yes? If
you're running w/o swap, or little swap, the default settings of
overcommit_memory = 2 will cut your available RAM by a factor of 2.

This thread is interesting reading:
http://www.mail-archive.com/pgsql-general@postgresql.org/msg97648.html

Since disk space is usually cheap these days, my rule of thumb is (the
old one):

swap = 2 * ram

read it this way: "it you have 32GB of RAM, you can afford 64GB of disk
storage"

BTW, this is a good idea both with overcommit on and off, IMHO.

.TM.

Re: unexpected shutdown

From
Marco Colombo
Date:
developer@wexwarez.com wrote:
>> developer@wexwarez.com writes:
>>> My database has shutdown several times in the last couple days.  I have
>>> no
>>> idea why.  I am running centos and I have not rebooted the server or
>>> made
>>> any configuration changes.
>> So in particular, you didn't disable memory overcommit?
>>
>>> LOG:  server process (PID 501) was terminated by signal 9
>> If you didn't issue a manual kill -9, then this is almost certainly a
>> trace of the kernel OOM killer at work.  Google for "OOM kill" to learn
>> more, or see "memory overcommit" in the PG docs.
>>
>> Memory overcommit is evil on a server.
>>
>>             regards, tom lane
>>
>
>
> You guys were right
> :Jun 17 11:04:57 kernel: Out of Memory: Killed process 24928 (postmaster).
>
> I did not disable memory overcommit.  I guess this is something I will
> have to do.  I have actually never seen this before or heard of memory
> overcommit.  I am surprised a setting like this comes enabled by default.
> I read a bit about it and it seems to make sense to disable it, but from
> practical experience do you know of any negative side effects?

The consensus on using overcommit_memory = 2 is far from general.

Your problem is a "java application with memory issues", so I think you
should address that directly first. Either run it elsewhere (and turn
the host running PG into a dedicated one) or fix its memory leaks or use
   resource limits provided by the OS to limit the java app.

Linux kernel people aren't totally clueless about VM. If they chose to
keep overcommiting and the OOM killer enabled by default, there're reasons.

With overcommitting on, you save al lot of swap space from being
allocated, leaving it for stuff that is actually used and not just
potentially used. The overall system throughput is thus higher.

When it comes to OOM situation, with overcommitting off things aren't
much better. First, OOM happens much before than with overcommiting on.
This usually isn't perceived as a big advantage, since 95% of the cases
the OOM is caused by one runaway process, so sooner or later it will
cause OOM either way. But in a correctly administered server, with OS
limits configured, a single runaway process doesn't cause OOM. OOM may
still happen for excessive load, and I'd rather see my system handle
some high load spikes than go into OOM situation. So lowering the
threshold of what 'excessive load' is, isn't necessarily a good idea.

And OK, let's say you've hit OOM anyway. There's no win-win solution.
Having PG processes SIGKILL'd is quite bad. But sitting in front of a
keyboard watching your system die w/o being able to login (OOM, so fork
fails) isn't much better. You may be able to do something (sysrq, maybe)
but the chances you manage to run a proper shutdown are quite thin, in
the general case. So you have to choose between the risk of PG being
SIGKILL'd (but the OOM _may_ pick the right process instead) and the
risk of being forced into hitting the 'reset' button. Either way, your
precious data isn't happy at all.

So the bottom line is, avoid OOM by properly configuing OS resource
limits. If you don't, then overcommit_memory = 2 is _definitely_ better.
If you do, it's a hard call. If you think about it, the funny thing is
that the more experienced the sysadm you're talking to is, the less
experience he has about handling OOM situations. By definition. :)

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it