Thread: DB failure?
PostgreSQL: 7.4.1 Last week, I had a corrupt index on one table with 2 million rows. On a specific search, the database would SEGV. I dropped and recreated the index involved in the search, and did a REINDEX on the primary key. That problem went away. Now I'm seeing: db=> select count(*) from messages; ERROR: could not access status of transaction 859000513 DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or directory db=> select count(*) from message_recipients; ERROR: invalid page header in block 1238604 of relation "message_recipients" The above commands were successful on 8/21. There are 240 million rows. Dump/reload is not something that would be an attractive option. Based on previous timings, it could take 2-3 days (this is with a dual hyper-threaded 2.4 Ghz with 2GB memory and an 8 drive RAID 5) on a production system. So far, this database has been all INSERTs - no deletes or updates. Is there a way to recover this without a dump/reload? We do nightly backups, but since we don't know when the problem really started, it would be rather difficult to restore and reapply several million updates. Wes
On 8/30/04 11:07 PM, "Wes Palmer" <wespvp@syntegra.com> wrote: > db=> select count(*) from messages; > ERROR: could not access status of transaction 859000513 > DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or > directory > > db=> select count(*) from message_recipients; > ERROR: invalid page header in block 1238604 of relation > "message_recipients" Uh, oh.. This would appear to be a big problem... I just tried to do a pg_dumpall. The server SEGV'd around a Gig into the pg_dumpall: LOG: server process (PID 12541) was terminated by signal 11 LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted at 2004-08-31 07:13:31 MEST LOG: checkpoint record is at 6E/13A916C LOG: redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 173895; next OID: 243689524 LOG: database system was not properly shut down; automatic recovery in progress LOG: record with zero length at 6E/13A91AC LOG: redo is not required LOG: recycled transaction log file "0000006E00000000" LOG: database system is ready
Time to test your memory and harddisk.... On Tue, Aug 31, 2004 at 12:24:53AM -0500, Wes wrote: > On 8/30/04 11:07 PM, "Wes Palmer" <wespvp@syntegra.com> wrote: > > > db=> select count(*) from messages; > > ERROR: could not access status of transaction 859000513 > > DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or > > directory > > > > db=> select count(*) from message_recipients; > > ERROR: invalid page header in block 1238604 of relation > > "message_recipients" > > Uh, oh.. This would appear to be a big problem... I just tried to do a > pg_dumpall. The server SEGV'd around a Gig into the pg_dumpall: > > LOG: server process (PID 12541) was terminated by signal 11 > LOG: terminating any other active server processes > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the > current transaction and exit, because another server process exited > abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > LOG: all server processes terminated; reinitializing > LOG: database system was interrupted at 2004-08-31 07:13:31 MEST > LOG: checkpoint record is at 6E/13A916C > LOG: redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE > LOG: next transaction ID: 173895; next OID: 243689524 > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: record with zero length at 6E/13A91AC > LOG: redo is not required > LOG: recycled transaction log file "0000006E00000000" > LOG: database system is ready > > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Mon, 30 Aug 2004, Wes Palmer wrote: > PostgreSQL: 7.4.1 > > Last week, I had a corrupt index on one table with 2 million rows. On a > specific search, the database would SEGV. I dropped and recreated the index > involved in the search, and did a REINDEX on the primary key. That problem > went away. > > Now I'm seeing: > > db=> select count(*) from messages; > ERROR: could not access status of transaction 859000513 > DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or > directory I saw the above 2 types of errors (transaction status, segv) yesterday on a box that turned out to have issues writing files correctly to disk. I wrote a tool to write a large file to disk and then reopened the file to read/verify the contents and it would fail every so often (like 1-5% of the time.) What puzzles me is that the machine would work at all with an issue like that. Turning on/off the battery backed cache had no effect. The machine has ECC memory, but I tested that as well but it turned up nothing. I am using PostgreSQL 7.4.1.