Thread: DB failure?

DB failure?

From
Wes Palmer
Date:
PostgreSQL: 7.4.1

Last week, I had a corrupt index on one table with 2 million rows.  On a
specific search, the database would SEGV.  I dropped and recreated the index
involved in the search, and did a REINDEX on the primary key.  That problem
went away.

Now I'm seeing:

db=> select count(*) from messages;
ERROR:  could not access status of transaction 859000513
DETAIL:  could not open file "/db/pgsql/data/pg_clog/0333": No such file or
directory

db=> select count(*) from message_recipients;
ERROR:  invalid page header in block 1238604 of relation
"message_recipients"


The above commands were successful on 8/21.

There are 240 million rows.  Dump/reload is not something that would be an
attractive option.  Based on previous timings, it could take 2-3 days (this
is with a dual hyper-threaded 2.4 Ghz with 2GB memory and an 8 drive RAID 5)
on a production system.

So far, this database has been all INSERTs - no deletes or updates.

Is there a way to recover this without a dump/reload?  We do nightly
backups, but since we don't know when the problem really started, it would
be rather difficult to restore and reapply  several million updates.

Wes


Re: DB failure?

From
Wes
Date:
On 8/30/04 11:07 PM, "Wes Palmer" <wespvp@syntegra.com> wrote:

> db=> select count(*) from messages;
> ERROR:  could not access status of transaction 859000513
> DETAIL:  could not open file "/db/pgsql/data/pg_clog/0333": No such file or
> directory
>
> db=> select count(*) from message_recipients;
> ERROR:  invalid page header in block 1238604 of relation
> "message_recipients"

Uh, oh..  This would appear to be a big problem...  I just tried to do a
pg_dumpall.  The server SEGV'd around a Gig into the pg_dumpall:

LOG:  server process (PID 12541) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2004-08-31 07:13:31 MEST
LOG:  checkpoint record is at 6E/13A916C
LOG:  redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 173895; next OID: 243689524
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 6E/13A91AC
LOG:  redo is not required
LOG:  recycled transaction log file "0000006E00000000"
LOG:  database system is ready



Re: DB failure?

From
Martijn van Oosterhout
Date:
Time to test your memory and harddisk....

On Tue, Aug 31, 2004 at 12:24:53AM -0500, Wes wrote:
> On 8/30/04 11:07 PM, "Wes Palmer" <wespvp@syntegra.com> wrote:
>
> > db=> select count(*) from messages;
> > ERROR:  could not access status of transaction 859000513
> > DETAIL:  could not open file "/db/pgsql/data/pg_clog/0333": No such file or
> > directory
> >
> > db=> select count(*) from message_recipients;
> > ERROR:  invalid page header in block 1238604 of relation
> > "message_recipients"
>
> Uh, oh..  This would appear to be a big problem...  I just tried to do a
> pg_dumpall.  The server SEGV'd around a Gig into the pg_dumpall:
>
> LOG:  server process (PID 12541) was terminated by signal 11
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> LOG:  all server processes terminated; reinitializing
> LOG:  database system was interrupted at 2004-08-31 07:13:31 MEST
> LOG:  checkpoint record is at 6E/13A916C
> LOG:  redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE
> LOG:  next transaction ID: 173895; next OID: 243689524
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> LOG:  record with zero length at 6E/13A91AC
> LOG:  redo is not required
> LOG:  recycled transaction log file "0000006E00000000"
> LOG:  database system is ready
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: DB failure?

From
Chester Kustarz
Date:
On Mon, 30 Aug 2004, Wes Palmer wrote:
> PostgreSQL: 7.4.1
>
> Last week, I had a corrupt index on one table with 2 million rows.  On a
> specific search, the database would SEGV.  I dropped and recreated the index
> involved in the search, and did a REINDEX on the primary key.  That problem
> went away.
>
> Now I'm seeing:
>
> db=> select count(*) from messages;
> ERROR:  could not access status of transaction 859000513
> DETAIL:  could not open file "/db/pgsql/data/pg_clog/0333": No such file or
> directory

I saw the above 2 types of errors (transaction status, segv) yesterday on
a box that turned out to have issues writing files correctly to disk. I
wrote a tool to write a large file to disk and then reopened the file
to read/verify the contents and it would fail every so often (like 1-5% of
the time.) What puzzles me is that the machine would work at all with an
issue like that. Turning on/off the battery backed cache had no effect.

The machine has ECC memory, but I tested that as well but it turned up
nothing. I am using PostgreSQL 7.4.1.