Database corruption? - Mailing list pgsql-general

From Alvaro Herrera
Subject Database corruption?
Date
Msg-id Pine.LNX.4.33L2.0110220136050.13022-100000@aguila.protecne.cl
Whole thread Raw
Responses Re: Database corruption?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Hello:

I didn't think I would live to see a corrupted database, but I now have.
I don't know how it happened; it looks like some table got corrupted
during VACUUM and now the database won't start. The serverlog shows:

invoking IpcMemoryCreate(size=3203072)
FindExec: found "/usr/local/pgsql/bin/postmaster" using argv[0]
DEBUG:  database system shutdown was interrupted at 2001-10-22 01:03:37 CLST
DEBUG:  CheckPoint record at (13, 3399750448)
DEBUG:  Redo record at (13, 3399750448); Undo record at (0, 0); Shutdown TRUE
DEBUG:  NextTransactionId: 20960076; NextOid: 41447617
DEBUG:  database system was not properly shut down; automatic recovery in progress...
DEBUG:  redo starts at (13, 3399750512)
REDO @ 13/3399750512; LSN 13/3399750548: prev 13/3399750448; xprev 0/0; xid 20960086: XLOG - nextOid: 41455809
REDO @ 13/3399750548; LSN 13/3399758820: prev 13/3399750512; xprev 0/0; xid 20960086; bkpb 1: Heap - insert: node
16283895/16287107;tid 333/97 
REDO @ 13/3399758820; LSN 13/3399767092: prev 13/3399750548; xprev 13/3399750548; xid 20960086; bkpb 1: Btree - insert:
node16283895/23651833; tid 195/257 
REDO @ 13/3399767092; LSN 13/3399767164: prev 13/3399758820; xprev 13/3399758820; xid 20960086: Heap - insert: node
16283895/16287107;tid 333/98 
REDO @ 13/3399767164; LSN 13/3399767228: prev 13/3399767092; xprev 13/3399767092; xid 20960086: Btree - insert: node
16283895/23651833;tid 195/258 
REDO @ 13/3399767228; LSN 13/3399767300: prev 13/3399767164; xprev 13/3399767164; xid 20960086: Heap - insert: node
16283895/16287107;tid 333/99 
REDO @ 13/3399767300; LSN 13/3399767364: prev 13/3399767228; xprev 13/3399767228; xid 20960086: Btree - insert: node
16283895/23651833;tid 195/259 
REDO @ 13/3399767364; LSN 13/3399767448: prev 13/3399767300; xprev 13/3399767300; xid 20960086: Heap - update: node
16283895/16287620;tid 428/79; new 428/129 
REDO @ 13/3399767448; LSN 13/3399775720: prev 13/3399767364; xprev 13/3399767364; xid 20960086; bkpb 1: Btree - insert:
node16283895/23651923; tid 4/2 
REDO @ 13/3399775720; LSN 13/3399775780: prev 13/3399767448; xprev 13/3399767448; xid 20960086: Btree - insert: node
16283895/23651926;tid 233/6 
DEBUG:  ReadRecord: record with zero len at (13, 3399775780)
DEBUG:  redo done at (13, 3399775720)
XLogFlush: rqst 13/3399767300; wrt 13/3399775780; flsh 13/3399775780
XLogFlush: rqst 13/3399767364; wrt 13/3399775780; flsh 13/3399775780
XLogFlush: rqst 13/3400103600; wrt 13/3399775780; flsh 13/3399775780
FATAL 2:  XLogFlush: request is not satisfied
DEBUG:  proc_exit(2)
DEBUG:  shmem_exit(2)
DEBUG:  exit(2)
/usr/local/pgsql/bin/postmaster: reaping dead processes...
/usr/local/pgsql/bin/postmaster: Startup proc 3855 exited with status 512 - abort
/usr/local/pgsql/bin/postmaster: PostmasterMain: initial environ dump:
[blah]


And here is a backtrace taken from a core file I found laying around,
which has a timestamp makes me think it has something to say:

(gdb) bt
#0  0x4018cbf4 in memmove () from /lib/libc.so.6
#1  0x08100f85 in PageRepairFragmentation ()
#2  0x080ae9a7 in scan_heap ()
#3  0x080adfb4 in vacuum_rel ()
#4  0x080adbee in vac_vacuum ()
#5  0x080adb68 in vacuum ()
#6  0x08105c72 in ProcessUtility ()
#7  0x081039d9 in pg_exec_query_string ()
#8  0x08104adb in PostgresMain ()
#9  0x080ee7e4 in DoBackend ()
#10 0x080ee3c5 in BackendStartup ()
#11 0x080ed599 in ServerLoop ()
#12 0x080ecfa6 in PostmasterMain ()
#13 0x080ccb8f in main ()
#14 0x401231f0 in __libc_start_main () from /lib/libc.so.6

The database has been running for months without trouble. I'm now trying
desperate measures, but I fear I will have to restore from backup (a week
old). I have taken a tarball of the complete location (pg_xlog included and
all that stuff) if anyone wants to see it (but it's 2 GB).

I think I know what table is dead, but I don't know what to do with that
information :-( In the serverlog, I see

DEBUG:  --Relation delay_171--
NOTICE:  Rel delay_171: TID 15502/4279: OID IS INVALID. TUPGONE 0.
NOTICE:  Rel delay_171: TID 15502/4291: OID IS INVALID. TUPGONE 1.
NOTICE:  Rel delay_171: TID 15502/4315: OID IS INVALID. TUPGONE 1.
NOTICE:  Rel delay_171: TID 15502/4375: OID IS INVALID. TUPGONE 0.
NOTICE:  Rel delay_171: TID 15502/4723: OID IS INVALID. TUPGONE 1.
NOTICE:  Rel delay_171: TID 15502/4771: OID IS INVALID. TUPGONE 0.
NOTICE:  Rel delay_171: TID 15502/4783: OID IS INVALID. TUPGONE 0.
NOTICE:  Rel delay_171: TID 15502/4831: OID IS INVALID. TUPGONE 1.
NOTICE:  Rel delay_171: TID 15502/4843: OID IS INVALID. TUPGONE 0.
NOTICE:  Rel delay_171: TID 15502/4867: InsertTransactionInProgress 0 - can't shrink relation
NOTICE:  Rel delay_171: TID 15502/4867: OID IS INVALID. TUPGONE 0.
[a lot similarly looking lines]
NOTICE:  Rel delay_171: TID 15502/6067: OID IS INVALID. TUPGONE 0.
Server process (pid 22773) exited with status 139 at Sun Oct 21 02:30:27 2001
Terminating any active server processes...
NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend  died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am       going to terminate your database system connection and
exit.
        Please reconnect to the database system and repeat your query.

(this is way before the lines from the serverlog I showed earlier)

I really don't know what to do from here.

--
Alvaro Herrera (<alvherre[@]atentus.com>)
"La rebeldia es la virtud original del hombre" (Arthur Schopenhauer)


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Corrupted end of Record
Next
From: Barry Lind
Date:
Subject: Re: Problems setting up PostgreSQL and Tomcat JDBC connection