Thread: postmaster fails to start

postmaster fails to start

From
"Dweck Nir"
Date:

Hi,

I need urgent help.

I am using PostgreSQL version 8.0.1.

postmaster fails to start and the log file looks as follow:

LOG:  database system was shut down at 2005-05-24 15:50:46 MSD

LOG:  checkpoint record is at 1/8D117BE4

LOG:  redo record is at 1/8D117BE4; undo record is at 0/0; shutdown TRUE

LOG:  next transaction ID: 3859443; next OID: 1904360

LOG:  database system is ready

LOG:  could not send data to client: Broken pipe

LOG:  received smart shutdown request

LOG:  checkpoints are occurring too frequently (18 seconds apart)

HINT:  Consider increasing the configuration parameter "checkpoint_segments".

LOG:  database system was interrupted at 2005-05-24 16:07:50 MSD

LOG:  checkpoint record is at 1/A50109AC

LOG:  redo record is at 1/A500075C; undo record is at 0/0; shutdown FALSE

LOG:  next transaction ID: 3859613; next OID: 1904360

LOG:  database system was not properly shut down; automatic recovery in progress

LOG:  redo starts at 1/A500075C

PANIC:  btree_delete_page_redo: lost target page

LOG:  startup process (PID 4409) was terminated by signal 6

LOG:  aborting startup due to startup process failure

LOG:  logger shutting down

LOG:  database system was interrupted while in recovery at 2005-05-24 16:11:14 MSD

HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.

LOG:  checkpoint record is at 1/A50109AC

LOG:  redo record is at 1/A500075C; undo record is at 0/0; shutdown FALSE

LOG:  next transaction ID: 3859613; next OID: 1904360

LOG:  database system was not properly shut down; automatic recovery in progress

LOG:  redo starts at 1/A500075C

PANIC:  btree_delete_page_redo: lost target page

LOG:  startup process (PID 4417) was terminated by signal 6

LOG:  aborting startup due to startup process failure

LOG:  logger shutting down

The sequence of events was as follow:

1) computer was shut down without stopping postmaster.

2) postmaster was started, but because of an error that there might be another postmaster running, the postmaster was started again.

3) since then each time I try to start the postmaster I get the same error.

To start the postmaster I use "pg_ctl start".

I am running on Linux redhat 9 kernel 2.4..20-8.

Regards,

Nir Dweck

Computer engineer

Tadiran Telecom

18 Hasivim Street,

P.O.Box 7607

Petach-Tikva 49170 Israel

Tel:  972-3-9262807

Fax: 972-3-9262755

<mailto:Nir.dweck@tadirantele.com>

Re: postmaster fails to start

From
Richard Huxton
Date:
I've taken the liberty of rearranging your email slightly.

Dweck Nir wrote:
> The sequence of events was as follow: 1) computer was shut down
> without stopping postmaster.

OK - not good. Some crucial questions:
1. Do you have fsync enabled or disabled in the postgresql.conf file?
2. Do you know whether your drives are flushing write-cache properly?

> 2) postmaster was started, but because of an error that there might
> be another postmaster running, the postmaster was started again.

Was this just a matter of deleting the .pid file and did you check there
wasn't another postmaster running?

> 3) since then each time I try to start the postmaster I get the same
> error.


 > LOG:  redo starts at 1/A500075C PANIC:  btree_delete_page_redo: lost
 > target page LOG:  startup process (PID 4409) was terminated by signal
 > 6

OK - well, this error message is in backend/access/nbtree/nbtxlog.c
where it is replaying the write-ahead-log files for btrees (I'm no
hacker, I just searched the source for the error message and read the
comments).

So - it looks like you might have a corrupted WAL. That shouldn't be
possible if you were running with fsync enabled and drives that flushed
cache like they should, so I'm guessing that wasn't the case.

It might be possible to recover to a state before this point, but that's
not something I'm going to be able to advise on. There are two steps you
should take immediately though.

1. Take a file-backup of your entire data directory and keep it safe.
You might well be making repeated attempts to recover this.
2. Check your most recent database backup and restore it to another
machine - it may be quicker to restore that than fix your file corruption.

--
   Richard Huxton
   Archonet Ltd

Re: postmaster fails to start

From
Richard Huxton
Date:
Dweck Nir wrote:
> Hi,
> I need urgent help.
> I am using PostgreSQL version 8.0.1.
> postmaster fails to start and the log file looks as follow:
>
> LOG:  database system was shut down at 2005-05-24 15:50:46 MSD
> LOG:  checkpoint record is at 1/8D117BE4
> LOG:  redo record is at 1/8D117BE4; undo record is at 0/0; shutdown TRUE
> LOG:  next transaction ID: 3859443; next OID: 1904360

Actually, it might be possible to use the PITR system to restore up to
just before the error (the transaction-id above might be a good start
point).

You'll want to move your WAL files to a different directory so it looks
like they've been copied from another machine. See this section of the
manuals for details of how to set up the recovery. Take your time
reading it thoroughly.
   http://www.postgresql.org/docs/8.0/static/backup-online.html

IMPORTANT - make sure you have a backup copy of the entire data
directory before trying this.

A warning - I've not tried this particular idea out, but as long as you
can partially replay the first WAL file, I don't see why it shouldn't work.

--
   Richard Huxton
   Archonet Ltd

Re: postmaster fails to start

From
"Dweck Nir"
Date:
hi,
1) when the postmaster was started the first time, it was just a matter of .pid file not being erased, since the
machinewas restarted. There was no other postmaster running. 
2) all the WAL configurations are as default:
#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

#fsync = true            # turns forced synchronization on or off
#wal_sync_method = fsync    # the default varies across platforms:
                # fsync, fdatasync, open_sync, or open_datasync
#wal_buffers = 8        # min 4, 8KB each
#commit_delay = 0        # range 0-100000, in microseconds
#commit_siblings = 5        # range 1-1000

# - Checkpoints -

#checkpoint_segments = 3    # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300    # range 30-3600, in seconds
#checkpoint_warning = 30    # 0 is off, in seconds

# - Archiving -

#archive_command = ''        # command to use to archive a logfile segment

3) I have he data backed up in other databases (not as a file backup), so I am really not so concerned about loosing
thedata (in this specific case).  The problem is that the postmaster isn't starting so I can't even restore the data.
Mostimportantly I would like to learn from this case what to do next time this problem happens to me in the field. 

Regards,
Nir.

-----Original Message-----
From: Richard Huxton [mailto:dev@archonet.com]
Sent: Wednesday, May 25, 2005 11:51 AM
To: Dweck Nir
Cc: postgreSQL mailing list (E-mail)
Subject: Re: [GENERAL] postmaster fails to start


I've taken the liberty of rearranging your email slightly.

Dweck Nir wrote:
> The sequence of events was as follow: 1) computer was shut down
> without stopping postmaster.

OK - not good. Some crucial questions:
1. Do you have fsync enabled or disabled in the postgresql.conf file?
2. Do you know whether your drives are flushing write-cache properly?

> 2) postmaster was started, but because of an error that there might
> be another postmaster running, the postmaster was started again.

Was this just a matter of deleting the .pid file and did you check there
wasn't another postmaster running?

> 3) since then each time I try to start the postmaster I get the same
> error.


 > LOG:  redo starts at 1/A500075C PANIC:  btree_delete_page_redo: lost
 > target page LOG:  startup process (PID 4409) was terminated by signal
 > 6

OK - well, this error message is in backend/access/nbtree/nbtxlog.c
where it is replaying the write-ahead-log files for btrees (I'm no
hacker, I just searched the source for the error message and read the
comments).

So - it looks like you might have a corrupted WAL. That shouldn't be
possible if you were running with fsync enabled and drives that flushed
cache like they should, so I'm guessing that wasn't the case.

It might be possible to recover to a state before this point, but that's
not something I'm going to be able to advise on. There are two steps you
should take immediately though.

1. Take a file-backup of your entire data directory and keep it safe.
You might well be making repeated attempts to recover this.
2. Check your most recent database backup and restore it to another
machine - it may be quicker to restore that than fix your file corruption.

--
   Richard Huxton
   Archonet Ltd

Re: postmaster fails to start

From
Tom Lane
Date:
"Dweck Nir" <Nir.Dweck@tadirantele.com> writes:
> LOG:  database system was not properly shut down; automatic recovery in =
> progress
> LOG:  redo starts at 1/A500075C
> PANIC:  btree_delete_page_redo: lost target page

This seems closely related to the problem discussed in this recent
thread:
http://archives.postgresql.org/pgsql-admin/2005-04/msg00008.php

            regards, tom lane