Thread: Power outage borked things (8.1.10)...

Power outage borked things (8.1.10)...

From
Darren Reed
Date:
Starting up postgres, I get the log contents below.

Is it really as bad as it suggests, namely that I
need to recover from backup?

The pg_clog directory has the following:
# ls -al pg_clog/
total 1216
drwx------   2 postgres  wheel     512 Jan 17 09:09 ./
drwx------  10 postgres  wheel     512 Feb 20 00:44 ../
-rw-------   1 postgres  wheel  253952 Dec  8 11:09 0000
-rw-------   1 postgres  wheel  253952 Dec 19 17:29 0001
-rw-------   1 postgres  wheel  253952 Jan  2 17:49 0002
-rw-------   1 postgres  wheel  253952 Jan 16 10:19 0003
-rw-------   1 postgres  wheel   90112 Jan 22 15:34 0004

The error seems to suggest that it is at EOF...

Darren

LOG:  startup process (PID 765) was terminated by signal 6
LOG:  aborting startup due to startup process failure
LOG:  database system was interrupted while in recovery at 2008-02-20
00:44:08 PST
HINT:  This probably means that some data is corrupted and you will have
to use the last backup for recovery.
LOG:  checkpoint record is at 2/64507F8C
LOG:  redo record is at 2/64507F8C; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 4570963; next OID: 24576
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 2/64507FD0
PANIC:  could not access status of transaction 4570965
DETAIL:  could not read from file "pg_clog/0004" at offset 90112:
Undefined error: 0
LOG:  startup process (PID 306) was terminated by signal 6
LOG:  aborting startup due to startup process failure


Re: Power outage borked things (8.1.10)...

From
Tom Lane
Date:
Darren Reed <darrenr+postgres@fastmail.net> writes:
> Starting up postgres, I get the log contents below.
> Is it really as bad as it suggests, namely that I
> need to recover from backup?

Probably :-(

pg_resetxlog would let you into the database, but I do not have high
hopes about the consistency/correctness of what you'll find.  The best
advice would be:

1. pg_resetxlog

2. pg_dumpall

3. initdb, restore from backup

4. compare dump from step 2 to backup dump, apply any changes that seem
sane

And don't forget

5. Figure out why a simple power failure was able to do this to you,
and fix it.  The most likely bet is that your disk drives are lying
about write completion ... see the PG archives for discussion.

            regards, tom lane

Re: Power outage borked things (8.1.10)...

From
Darren Reed
Date:
Tom Lane wrote:
> Darren Reed <darrenr+postgres@fastmail.net> writes:
> > Starting up postgres, I get the log contents below.
> > Is it really as bad as it suggests, namely that I
> > need to recover from backup?
>
> Probably :-(

I've started a new db in parallel with the old data and I'm rebuilding,
and if I can rebuild quicker than I can recover old data, I'll do that.

> pg_resetxlog would let you into the database, but I do not have high
> hopes about the consistency/correctness of what you'll find.  The best
> advice would be:
>
> 1. pg_resetxlog

# su postgres -c "/usr/pkg/bin/pg_resetxlog -n /data/db"
pg_control values:

pg_control version number:            812
Catalog version number:               200510211
Database system identifier:           5138205682483264479
Current log file ID:                  2
Next log file segment:                103
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          4570963
Latest checkpoint's NextOID:          24576
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum columns in an index:          32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C
# su postgres -c "/usr/pkg/bin/pg_resetxlog -f /data/db"
Transaction log reset
And a start is greated with:
LOG:  database system was shut down at 2008-02-20 11:04:51 PST
LOG:  checkpoint record is at 2/6A00001C
LOG:  redo record is at 2/6A00001C; undo record is at 2/6A00001C;
shutdown TRUE
LOG:  next transaction ID: 4570963; next OID: 24576
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
PANIC:  could not access status of transaction 4570963
DETAIL:  could not read from file "pg_clog/0004" at offset 90112:
Undefined error: 0
LOG:  startup process (PID 29662) was terminated by signal 6
LOG:  aborting startup due to startup process failure


>
> 2. pg_dumpall
>
> 3. initdb, restore from backup
>
> 4. compare dump from step 2 to backup dump, apply any changes that seem
> sane
>
> And don't forget
>
> 5. Figure out why a simple power failure was able to do this to you,
> and fix it.  The most likely bet is that your disk drives are lying
> about write completion ... see the PG archives for discussion.

Yes, that I can believe.

Darren


Re: Power outage borked things (8.1.10)...

From
Tom Lane
Date:
Darren Reed <darrenr+postgres@fastmail.net> writes:
> # su postgres -c "/usr/pkg/bin/pg_resetxlog -f /data/db"
> Transaction log reset

> PANIC:  could not access status of transaction 4570963
> DETAIL:  could not read from file "pg_clog/0004" at offset 90112:
> Undefined error: 0

You may need to append a few pages of zeroes to pg_clog/0004 ...

            regards, tom lane

Re: Power outage borked things (8.1.10)...

From
Darren Reed
Date:
Tom Lane wrote:
> Darren Reed <darrenr+postgres@fastmail.net> writes:
> > # su postgres -c "/usr/pkg/bin/pg_resetxlog -f /data/db"
> > Transaction log reset
>
> > PANIC:  could not access status of transaction 4570963
> > DETAIL:  could not read from file "pg_clog/0004" at offset 90112:
> > Undefined error: 0
>
> You may need to append a few pages of zeroes to pg_clog/0004 ...
>

So doing this:
# dd if=/dev/zero bs=8192 count=1 | cat >> 0004

allowed the database to start, but tables are missing.

I'll keep restoring.

Thanks,
Darren


Re: Power outage borked things (8.1.10)...

From
Michael Monnerie
Date:
On Mittwoch, 20. Februar 2008 Tom Lane wrote:
> 5. Figure out why a simple power failure was able to do this to you,
> and fix it.  The most likely bet is that your disk drives are lying
> about write completion ... see the PG archives for discussion.

It could be a caching RAID controller without battery backup unit.
Nothing help you in a power failure then, except a good backup.
Had this with a customer - no more filesystem after reboot...

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: www.keyserver.net                   Key-ID: 1C1209B4

Attachment

Re: Power outage borked things (8.1.10)...

From
Hannes Dorbath
Date:
Michael Monnerie wrote:
> On Mittwoch, 20. Februar 2008 Tom Lane wrote:
>> 5. Figure out why a simple power failure was able to do this to you,
>> and fix it.  The most likely bet is that your disk drives are lying
>> about write completion ... see the PG archives for discussion.
>
> It could be a caching RAID controller without battery backup unit.
> Nothing help you in a power failure then, except a good backup.
> Had this with a customer - no more filesystem after reboot...

I hardly understand why such things still happen today. It's one of the
most important admin tasks to make sure fsync / fua behave as they should.

Wouldn't it be a good idea to add a chapter to the installation docs?

They cover SHM settings and vm.overcommit -- why not a short
introduction to fsync, write caches, write barriers and a few examples
of sane and problematic setups as well as links to tools to verify a setup?


--
Best regards,
Hannes Dorbath

Re: Power outage borked things (8.1.10)...

From
Steve Holdoway
Date:
On Wed, 20 Feb 2008 12:22:45 -0800
Darren Reed <darrenr+postgres@fastmail.net> wrote:

> # dd if=/dev/zero bs=8192 count=1 | cat >> 0004
dd if=/dev/zero bs=8192 count=1 >> 0004

saves loads of electrons (:

Attachment

Re: Power outage borked things (8.1.10)...

From
Bruce Momjian
Date:
Hannes Dorbath wrote:
> Michael Monnerie wrote:
> > On Mittwoch, 20. Februar 2008 Tom Lane wrote:
> >> 5. Figure out why a simple power failure was able to do this to you,
> >> and fix it.  The most likely bet is that your disk drives are lying
> >> about write completion ... see the PG archives for discussion.
> >
> > It could be a caching RAID controller without battery backup unit.
> > Nothing help you in a power failure then, except a good backup.
> > Had this with a customer - no more filesystem after reboot...
>
> I hardly understand why such things still happen today. It's one of the
> most important admin tasks to make sure fsync / fua behave as they should.
>
> Wouldn't it be a good idea to add a chapter to the installation docs?
>
> They cover SHM settings and vm.overcommit -- why not a short
> introduction to fsync, write caches, write barriers and a few examples
> of sane and problematic setups as well as links to tools to verify a setup?

We have a new section added in 8.3 for exactly that:

    http://www.postgresql.org/docs/8.3/static/wal-reliability.html

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Power outage borked things (8.1.10)...

From
Hannes Dorbath
Date:
Bruce Momjian wrote:
> We have a new section added in 8.3 for exactly that:

That's great, I missed it. May I suggest to add a warning about LVM,
EVMS, DRBD and Linux MD RAID 5? All do currently not support write
barriers and are therefor not safe with volatile write caches below.


--
Best regards,
Hannes Dorbath

Re: Power outage borked things (8.1.10)...

From
Michael Monnerie
Date:
On Sonntag, 2. März 2008 Hannes Dorbath wrote:
> That's great, I missed it. May I suggest to add a warning about LVM,
> EVMS, DRBD and Linux MD RAID 5? All do currently not support write
> barriers and are therefor not safe with volatile write caches below.

And what about VMware, XEN, etc., used on a Linux host with hardware
RAID with write-back cache? Not sure if this is 100% safe.

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: www.keyserver.net                   Key-ID: 1C1209B4

Attachment