Thread: recovery after segmentation fault

recovery after segmentation fault

From

Ivan Sergio Borgonovo

Date:

08 April 2009, 14:47:19

postgresql suddenly died...

during recovery

2009-04-08 16:35:34 CEST FATAL:  the database system is starting up
^^^ several
2009-04-08 16:35:34 CEST LOG:  incomplete startup packet
2009-04-08 16:36:53 CEST FATAL:  the database system is starting up
2009-04-08 16:36:53 CEST LOG:  startup process (PID 3176) was
terminated by signal 11: Segmentation fault 2009-04-08 16:36:53 CEST
LOG:  aborting startup due to startup process failure

It could be something wrong with the recovery process in an aborted
transaction that is causing the segfault...

How can I resurrect the server and load a backup?
It was serving more than one DB and I assume that only one is
causing problems. Can I skip just that one from recovery and start
from backup?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: recovery after segmentation fault

From

Tom Lane

Date:

08 April 2009, 14:59:57

Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> 2009-04-08 16:36:53 CEST LOG:  startup process (PID 3176) was
> terminated by signal 11: Segmentation fault 2009-04-08 16:36:53 CEST
> LOG:  aborting startup due to startup process failure

Hmm, what Postgres version is this?  Can you get a stack trace from
the startup process crash?

The only simple way out of this is to delete the presumably-corrupt
WAL log by running pg_resetxlog.  That will destroy the evidence
about what went wrong, though, so if you'd like to contribute to
preventing such problems in future you need to save a copy of everything
beforehand (eg, tar up all of $PGDATA).  Also you might have a corrupt
database afterwards :-(

            regards, tom lane

Re: recovery after segmentation fault

From

Ivan Sergio Borgonovo

Date:

08 April 2009, 15:24:34

On Wed, 08 Apr 2009 10:59:54 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> > 2009-04-08 16:36:53 CEST LOG:  startup process (PID 3176) was
> > terminated by signal 11: Segmentation fault 2009-04-08 16:36:53
> > CEST LOG:  aborting startup due to startup process failure
>
> Hmm, what Postgres version is this?  Can you get a stack trace from
> the startup process crash?

How on Debian?
Debian does all it's automagic stuff in init. I never learned how to
start pg manually.

> The only simple way out of this is to delete the presumably-corrupt
> WAL log by running pg_resetxlog.  That will destroy the evidence

I couldn't find it... mmm what a strange place for an executable:
/usr/lib/postgresql/8.3/bin/pg_resetxlog

> about what went wrong, though, so if you'd like to contribute to
> preventing such problems in future you need to save a copy of
> everything beforehand (eg, tar up all of $PGDATA).  Also you might
> have a corrupt database afterwards :-(

What if I just don't care about recovery of *one* DB (that is maybe
the culprit) and just see the server restart then just do a restore
from a VERY recent backup?

Is there a way to just kill recovery for one DB? Just don't start it
at all?

This is the same DB having problem with recreation of gin index
BTW... and I've the feeling that the problem is related to that
index once more... I was vacuuming full, I aborted...

I think the DB is trying to recreate the index but due to some
problem (can I say bug or is it too early?) it segfaults.

I think this could be of some help:

2009-04-08 16:47:13 CEST LOG:  database system was not properly shut
down; automatic recovery in progress
2009-04-08 16:47:13 CEST LOG: redo starts at 72/9200EBC8

BTW:
Linux amd64, debian stock kernel
Debian etch/backport: Version: 8.3.4-1~bpo40+1

Now let's learn how to use pg_resetxlog

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: recovery after segmentation fault

From

Martijn van Oosterhout

Date:

08 April 2009, 21:59:55

On Wed, Apr 08, 2009 at 05:24:08PM +0200, Ivan Sergio Borgonovo wrote:
> How on Debian?
> Debian does all it's automagic stuff in init. I never learned how to
> start pg manually.

What might be easier is turning on core dumps (ulimit -S -c unlimited)
and then start postgres and see if it drops a core dump, which you can
then feed to gdb.

All the binaries are in /usr/lib/postgresql/8.3/bin/ (Debian supports
parallel installs of multiple versions of postgres).

> What if I just don't care about recovery of *one* DB (that is maybe
> the culprit) and just see the server restart then just do a restore
> from a VERY recent backup?
>
> Is there a way to just kill recovery for one DB? Just don't start it
> at all?

Unfortunatly, the XLOG is shared betweens all databases on one cluster.

> This is the same DB having problem with recreation of gin index
> BTW... and I've the feeling that the problem is related to that
> index once more... I was vacuuming full, I aborted...
>
> I think the DB is trying to recreate the index but due to some
> problem (can I say bug or is it too early?) it segfaults.

Interesting, hope you can get a good backtrace.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

signature.asc

Re: recovery after segmentation fault

From

Ivan Sergio Borgonovo

Date:

08 April 2009, 23:16:09

On Wed, 8 Apr 2009 23:59:43 +0200
Martijn van Oosterhout <kleptog@svana.org> wrote:

> What might be easier is turning on core dumps (ulimit -S -c
> unlimited) and then start postgres and see if it drops a core

thanks.

> > Is there a way to just kill recovery for one DB? Just don't
> > start it at all?
>
> Unfortunatly, the XLOG is shared betweens all databases on one
> cluster.

bwaaa. That's a bit of a pain.

I'm trying to understand this a bit better...
I think nothing terrible really happened since:
a) the DB that has the higher write load was actually the one that
caused the problem and I restored from a backup.
b) the other DBs have some writes too... but the software using them
doesn't have any idea about transactions so it is built with atomic
statement in mind... No operation I can think of was writing in more
than one table and I'd think most (all?) the operations were atomic
at the statement level.

So if I lost some writes in logs for the other DBs... that shouldn't
be a problem, right? I just lost some data... not coherency? right?

> > This is the same DB having problem with recreation of gin index
> > BTW... and I've the feeling that the problem is related to that
> > index once more... I was vacuuming full, I aborted...

> > I think the DB is trying to recreate the index but due to some
> > problem (can I say bug or is it too early?) it segfaults.

> Interesting, hope you can get a good backtrace.

I backed up all the data dir.
I'm currently transferring it to my dev box.
I've already the same DB... but it is on lenny.
And it never gave me a problem.
Version are slightly different anyway:

Version: 8.3.6-1 (working)
Version: 8.3.4-1~bpo40+1 (sometimes problematic[1])

8.4 is at the door... and the only choice I have to fix the problem
on that box is:
- upgrade to lenny
- build postgresql from source, that is going to be a maintenance
  pain.

Could anything related to vacuum and/or gin index had been fixet
between 8.3.4 and 8.3.6?

I think that if I'll stick with some rituals I can live with it.
Avoid vacuum full when there is load and restart the server before
doing it.

[1] slow vacuum full and gin index update

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: recovery after segmentation fault

From

Craig Ringer

Date:

08 April 2009, 23:38:26

Martijn van Oosterhout wrote:
> On Wed, Apr 08, 2009 at 05:24:08PM +0200, Ivan Sergio Borgonovo wrote:
>> How on Debian?
>> Debian does all it's automagic stuff in init. I never learned how to
>> start pg manually.
>
> What might be easier is turning on core dumps (ulimit -S -c unlimited)
> and then start postgres and see if it drops a core dump, which you can
> then feed to gdb.

Note that ulimit is inherited by child processes; it doesn't apply
system wide. You'll need to set the ulimit somewhere like the postgresql
init script, where the postmaster is a child of the shell in which the
ulimit command is run.

Also, because Debian strips its binaries by default, you might need to
rebuild the postgresql packages with debugging enabled and without
stripping to get a useful backtrace. Worth a try anyway, though.

Does Debian have a repository full of debug symbol packages like Ubuntu
does?

--
Craig Ringer