Re: DB/clog corruption - Mailing list pgsql-general
From | Reid Thompson |
---|---|
Subject | Re: DB/clog corruption |
Date | |
Msg-id | F71C0DC6B4FD3648815AAA7F969E352901530B20@sr002-2kexc.ateb.com Whole thread Raw |
In response to | DB/clog corruption ("Reid Thompson" <Reid.Thompson@ateb.com>) |
List | pgsql-general |
Tom Lane wrote: > "Reid Thompson" writes: >> $ pg_dump -f table_dump.sql -t table_with_error dbname >> FATAL 2: read of clog file 1559, offset 0 failed: Success > > You've got a clobbered tuple header in that file (ridiculous > xmin or xmax value). Alternatively, given the evident age of > this server, it could be a clobbered page header leading to > the system following a bogus pointer to a tuple header. > > I think the track you want to pursue is identifying which > page contains the bad data and looking at it with pg_filedump > to see if there's anything recoverable there at all. If not, > just zeroing out the page is probably the easiest way of > getting to a dumpable state. You've probably lost this > particular tuple in any case, you might or might not have lost the > whole page. > > Once you get out of this, you might want to think about > updating to something newer than 7.2 ... > > regards, tom lane DB corruption resolution. PG_VERSION: 7.2.3 First, I would like to very much thank Tom Lane for guiding me in resolving this. Issues: A procedural error resulted in filesystem full ( out of disk space ) resulting in inability to create pg_clog file(s) There also existed a corrupted page in a single table. Symptoms: pg_dump failure, query "select * from" failure, vacuum failure all around "Oh crap" Disclaimer: If you are having a DB issue, do NOT assume that the steps outlined below will resolve it for you. The mailing list archives note that there is not 'one procedure' for recovery, most incidents are unique in some way. Many incidents may require certain 'same steps' to be performed, but there is no cookie cutter resolution. Search the archives, read the threads regarding incidents that are similar to or match yours and request help from the mailing list if you feel it is warranted. This synopsis is meant to, hopefully, add to the understanding one gets when searching the archives for issues it addresses. Initial symptom noted: pg_dump: FATAL 2: open of /postgres/data/pg_clog/0202 failed: No such file or directory pg_dump: lost synchronization with server, resetting connection pg_dump: SQL command to dump the contents of table "table_name_here" failed: PQendcopy() failed. pg_dump: Error message from server: FATAL 1: The database system is starting up pg_dump: The command was: COPY "table_name_here" TO stdout; ERROR: pg_dump "dbname" ERROR: backup failed Before performing any work I used pg_dump to individually dump all other tables in the database. First steps were to address the pg_clog file issues. Note: PG Versions 7.3 and above should not require manual intervention for pg_clog files after an out of space condition. Procedures outlined in the mailing list archives provided for this. Querying the mailing list archives for dd bs= pg_clog , etc should provide ample reading material on extending the file in the proper bs size increments for your situation. My pg instance used the default 8k block size... so, shutdown the db dd bs=8k count=1 < /dev/zero >>/postgres/data/pg_clog/0202 startup the db ... Note that the final size of your pg_clog file may need to something other than 8k, see the various threads in the archives related to this issue. Getting past this yielded a different error: $ pg_dump -f table_dump.sql -t table_with_error dbname FATAL 2: read of clog file 1559, offset 0 failed: Success pg_dump: FATAL 2: read of clog file 1559, offset 0 failed: Success pg_dump: lost synchronization with server, resetting connection DEBUG: server process (pid 1672) exited with exit code 2 DEBUG: terminating any other active server processes DEBUG: all server processes terminated; reinitializing shared memory and semaphores DEBUG: database system was interrupted at 2005-07-11 08:41:39 EDT DEBUG: checkpoint record is at 1/669E0528 DEBUG: redo record is at 1/669E0528; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 1237894; next oid: 2850421 DEBUG: database system was not properly shut down; automatic recovery in progress FATAL 1: The database system is starting up pg_dump: SQL command to dump the contents of table "table_with_error" failed: PQendcopy() failed. pg_dump: Error message from server: FATAL 1: The database system is starting up pg_dump: The command was: COPY "table_with_error" TO stdout; $ DEBUG: ReadRecord: record with zero length at 1/669E0568 DEBUG: redo is not required DEBUG: database system is ready Research in the archives, followed by a post to and receipt of response from the mailing list, resulted in using oid2name and pg_filedump to foster a resolution. oid2name was from the contrib source for my pg version. oid2name was utilized to get the pg filename for the affected table( 162239 ), and then find was used to determine where the table file resided ( find /postgres/data -name 162239 ). After oid2name and find provided /postgres/data/base/1153088/162239 for the affected table, pg_filedump was utilized to find the error cause. As an alternative to having to find/install oid2name and use as above, see http://www.postgresql.org/docs/8.0/static/storage.html for a description of the database file layout. Note that pg_filedump has some version specificity depending on what your pg version is. An archive search should provide links for download ( I used http://sources.redhat.com/rhdb/utilities.html to get the source ). The same search should provide some insight in how to use pg_filedump for error determination ( the README included gives example usage also -- read it ). I shutdown postgresql, copied the table file to a work area and used pg_filedump -i -f to generate output. Investigating through the output from pg_filedump on my affected table yielded the following: ***************************************************************** * PostgreSQL File/Block Formatted Dump Utility * * File: 162239 * Options used: -i -f * * Dump created on: Tue Jul 12 14:13:44 2005 ***************************************************************** Block 0 ****************************************************** .... info & data Block 1 .... info & data Block N .... Block 347 ****************************************************** <Header> ----- Block Offset: 0x002b6000 Offsets: Lower 62121 (0xf2a9) Block Size: 16732 Upper 16732 (0x415c) LSN: logid 1720768 recoff 0x00001000 Special 62121 (0xf2a9) Items: 15525 Free Space: 4294921907 Length (including item array): 8192 Error: Invalid header information. Error: End of block encountered within the header. Bytes read: 8192. 0000: c0411a00 00100000 040f5d41 a9f25c41 .A........]A..\A 0010: a9f25c41 00000000 1a000200 08000000 ..\A............ 0020: 00000000 00000000 fe010100 00000000 ................ 0030: 00000000 00000000 00000000 00000000 ................ 0040: 00000000 00000000 00000000 00000000 ................ 0050: 00000000 00000000 00000000 00000000 ................ 0060: 00000000 f3cff538 00000000 00000000 .......8........ 0070: 00000000 00000000 00000000 00000000 ................ 0080: 80811a00 00e00300 a1f25c41 598f823f ..........\AY..? 0090: 598f823f 00000000 1a000100 f8010000 Y..?............ ... lots more 'junk' like above, followed by what appeared to be several valid records, the end of the Block, and the start of the next Block. Another search in the archives and response from Tom, and I now know that I have a corrupt page in the table. Next step is to zero out the page. The archives describe examples of doing this, and Tom's response pretty explicitly stated what I needed to do dd conv=notrunc bs=8k seek=347 count=1 if=/dev/zero of=datafile Given that my datafile is /postgres/data/base/1153088/162239 ( provided by the earlier usage of oid2name and find) cp /postgres/data/base/1153088/162239 /to/a/safe/place/in/case/i/screw/up/so/i/can/recover dd conv=notrunc bs=8k seek=347 count=1 if=/dev/zero of=/postgres/data/base/1153088/162239 I then ran "pg_filedump -i -f /postgres/data/base/1153088/162239" and captured the output again. Block 347 now looks like this. Block 347 ****************************************************** <Header> ----- Block Offset: 0x002b6000 Offsets: Lower 0 (0x0000) Block Size: 0 Upper 0 (0x0000) LSN: logid 0 recoff 0x00000000 Special 0 (0x0000) Items: -5 Free Space: 0 Length (including item array): 24 Error: Invalid header information. 0000: 00000000 00000000 00000000 00000000 ................ 0010: 00000000 00000000 ........ <Data> ------ Error: Item index corrupt on block. Offset: <-5>. <Special Section> ----- Error: Invalid special section encountered. Error: Special section points off page. Unable to dump contents. I then restarted postgresql and successfully pg_dump'd the affected table. I then pg_dump'd the entire db, and reloaded the db from the dump.
pgsql-general by date: