Thread: Power outage borked things (8.1.10)...
Starting up postgres, I get the log contents below. Is it really as bad as it suggests, namely that I need to recover from backup? The pg_clog directory has the following: # ls -al pg_clog/ total 1216 drwx------ 2 postgres wheel 512 Jan 17 09:09 ./ drwx------ 10 postgres wheel 512 Feb 20 00:44 ../ -rw------- 1 postgres wheel 253952 Dec 8 11:09 0000 -rw------- 1 postgres wheel 253952 Dec 19 17:29 0001 -rw------- 1 postgres wheel 253952 Jan 2 17:49 0002 -rw------- 1 postgres wheel 253952 Jan 16 10:19 0003 -rw------- 1 postgres wheel 90112 Jan 22 15:34 0004 The error seems to suggest that it is at EOF... Darren LOG: startup process (PID 765) was terminated by signal 6 LOG: aborting startup due to startup process failure LOG: database system was interrupted while in recovery at 2008-02-20 00:44:08 PST HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. LOG: checkpoint record is at 2/64507F8C LOG: redo record is at 2/64507F8C; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 4570963; next OID: 24576 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 2/64507FD0 PANIC: could not access status of transaction 4570965 DETAIL: could not read from file "pg_clog/0004" at offset 90112: Undefined error: 0 LOG: startup process (PID 306) was terminated by signal 6 LOG: aborting startup due to startup process failure
Darren Reed <darrenr+postgres@fastmail.net> writes: > Starting up postgres, I get the log contents below. > Is it really as bad as it suggests, namely that I > need to recover from backup? Probably :-( pg_resetxlog would let you into the database, but I do not have high hopes about the consistency/correctness of what you'll find. The best advice would be: 1. pg_resetxlog 2. pg_dumpall 3. initdb, restore from backup 4. compare dump from step 2 to backup dump, apply any changes that seem sane And don't forget 5. Figure out why a simple power failure was able to do this to you, and fix it. The most likely bet is that your disk drives are lying about write completion ... see the PG archives for discussion. regards, tom lane
Tom Lane wrote: > Darren Reed <darrenr+postgres@fastmail.net> writes: > > Starting up postgres, I get the log contents below. > > Is it really as bad as it suggests, namely that I > > need to recover from backup? > > Probably :-( I've started a new db in parallel with the old data and I'm rebuilding, and if I can rebuild quicker than I can recover old data, I'll do that. > pg_resetxlog would let you into the database, but I do not have high > hopes about the consistency/correctness of what you'll find. The best > advice would be: > > 1. pg_resetxlog # su postgres -c "/usr/pkg/bin/pg_resetxlog -n /data/db" pg_control values: pg_control version number: 812 Catalog version number: 200510211 Database system identifier: 5138205682483264479 Current log file ID: 2 Next log file segment: 103 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 4570963 Latest checkpoint's NextOID: 24576 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Maximum data alignment: 4 Database block size: 8192 Blocks per segment of large relation: 131072 Maximum length of identifiers: 64 Maximum columns in an index: 32 Date/time type storage: floating-point numbers Maximum length of locale name: 128 LC_COLLATE: C LC_CTYPE: C # su postgres -c "/usr/pkg/bin/pg_resetxlog -f /data/db" Transaction log reset And a start is greated with: LOG: database system was shut down at 2008-02-20 11:04:51 PST LOG: checkpoint record is at 2/6A00001C LOG: redo record is at 2/6A00001C; undo record is at 2/6A00001C; shutdown TRUE LOG: next transaction ID: 4570963; next OID: 24576 LOG: next MultiXactId: 1; next MultiXactOffset: 0 PANIC: could not access status of transaction 4570963 DETAIL: could not read from file "pg_clog/0004" at offset 90112: Undefined error: 0 LOG: startup process (PID 29662) was terminated by signal 6 LOG: aborting startup due to startup process failure > > 2. pg_dumpall > > 3. initdb, restore from backup > > 4. compare dump from step 2 to backup dump, apply any changes that seem > sane > > And don't forget > > 5. Figure out why a simple power failure was able to do this to you, > and fix it. The most likely bet is that your disk drives are lying > about write completion ... see the PG archives for discussion. Yes, that I can believe. Darren
Darren Reed <darrenr+postgres@fastmail.net> writes: > # su postgres -c "/usr/pkg/bin/pg_resetxlog -f /data/db" > Transaction log reset > PANIC: could not access status of transaction 4570963 > DETAIL: could not read from file "pg_clog/0004" at offset 90112: > Undefined error: 0 You may need to append a few pages of zeroes to pg_clog/0004 ... regards, tom lane
Tom Lane wrote: > Darren Reed <darrenr+postgres@fastmail.net> writes: > > # su postgres -c "/usr/pkg/bin/pg_resetxlog -f /data/db" > > Transaction log reset > > > PANIC: could not access status of transaction 4570963 > > DETAIL: could not read from file "pg_clog/0004" at offset 90112: > > Undefined error: 0 > > You may need to append a few pages of zeroes to pg_clog/0004 ... > So doing this: # dd if=/dev/zero bs=8192 count=1 | cat >> 0004 allowed the database to start, but tables are missing. I'll keep restoring. Thanks, Darren
On Mittwoch, 20. Februar 2008 Tom Lane wrote: > 5. Figure out why a simple power failure was able to do this to you, > and fix it. The most likely bet is that your disk drives are lying > about write completion ... see the PG archives for discussion. It could be a caching RAID controller without battery backup unit. Nothing help you in a power failure then, except a good backup. Had this with a customer - no more filesystem after reboot... mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0676/846 914 666 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: www.keyserver.net Key-ID: 1C1209B4
Attachment
Michael Monnerie wrote: > On Mittwoch, 20. Februar 2008 Tom Lane wrote: >> 5. Figure out why a simple power failure was able to do this to you, >> and fix it. The most likely bet is that your disk drives are lying >> about write completion ... see the PG archives for discussion. > > It could be a caching RAID controller without battery backup unit. > Nothing help you in a power failure then, except a good backup. > Had this with a customer - no more filesystem after reboot... I hardly understand why such things still happen today. It's one of the most important admin tasks to make sure fsync / fua behave as they should. Wouldn't it be a good idea to add a chapter to the installation docs? They cover SHM settings and vm.overcommit -- why not a short introduction to fsync, write caches, write barriers and a few examples of sane and problematic setups as well as links to tools to verify a setup? -- Best regards, Hannes Dorbath
On Wed, 20 Feb 2008 12:22:45 -0800 Darren Reed <darrenr+postgres@fastmail.net> wrote: > # dd if=/dev/zero bs=8192 count=1 | cat >> 0004 dd if=/dev/zero bs=8192 count=1 >> 0004 saves loads of electrons (:
Attachment
Hannes Dorbath wrote: > Michael Monnerie wrote: > > On Mittwoch, 20. Februar 2008 Tom Lane wrote: > >> 5. Figure out why a simple power failure was able to do this to you, > >> and fix it. The most likely bet is that your disk drives are lying > >> about write completion ... see the PG archives for discussion. > > > > It could be a caching RAID controller without battery backup unit. > > Nothing help you in a power failure then, except a good backup. > > Had this with a customer - no more filesystem after reboot... > > I hardly understand why such things still happen today. It's one of the > most important admin tasks to make sure fsync / fua behave as they should. > > Wouldn't it be a good idea to add a chapter to the installation docs? > > They cover SHM settings and vm.overcommit -- why not a short > introduction to fsync, write caches, write barriers and a few examples > of sane and problematic setups as well as links to tools to verify a setup? We have a new section added in 8.3 for exactly that: http://www.postgresql.org/docs/8.3/static/wal-reliability.html -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > We have a new section added in 8.3 for exactly that: That's great, I missed it. May I suggest to add a warning about LVM, EVMS, DRBD and Linux MD RAID 5? All do currently not support write barriers and are therefor not safe with volatile write caches below. -- Best regards, Hannes Dorbath
On Sonntag, 2. März 2008 Hannes Dorbath wrote: > That's great, I missed it. May I suggest to add a warning about LVM, > EVMS, DRBD and Linux MD RAID 5? All do currently not support write > barriers and are therefor not safe with volatile write caches below. And what about VMware, XEN, etc., used on a Linux host with hardware RAID with write-back cache? Not sure if this is 100% safe. mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0676/846 914 666 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: www.keyserver.net Key-ID: 1C1209B4