Thread: Do I have a corrupted database?

Do I have a corrupted database?

From
William Garrison
Date:
I fear I have a corrupted database, and I'm not sure what to do.

Environment:
    Windows Server 2003
    8GB RAM
    Dual processor, quad core 2.6Ghz
    Postgres 8.2.3 (The IT dept wants to upgrade to 8.2.9, but they are
asking me what to do about this corrupt database before they proceed)
    The database files and logs are stored on a SAN drive

2008-08-23 06:57:06 FATAL:  could not create sigchld waiter thread:
error code 1816
*** ack!  13 hour hole!  What the...?
2008-08-23 20:00:27 ERROR:  xlog flush request E0/293CF278 is not
satisfied --- flushed only to E0/21B1B7F0
2008-08-23 20:00:27 CONTEXT:  writing block 94218 of relation
16712/16713/16725
2008-08-23 20:04:36 DETAIL:  Multiple failures --- write error may be
permanent.
2008-08-23 20:04:36 ERROR:  xlog flush request E0/4FC5BEB8 is not
satisfied --- flushed only to E0/21B9E270
2008-08-23 20:04:36 CONTEXT:  writing block 81033 of relation
16712/16713/16725
2008-08-23 20:04:36 STATEMENT:  BEGIN TRANSACTION; ... just a normal SQL
stored proc...
2008-08-23 20:04:36 DETAIL:  Multiple failures --- write error may be
permanent.
2008-08-23 20:04:36 ERROR:  xlog flush request E0/314D8248 is not
satisfied --- flushed only to E0/21B9E358
2008-08-23 20:04:36 CONTEXT:  writing block 371418 of relation
16712/16713/16719
2008-08-23 20:04:36 STATEMENT:  BEGIN TRANSACTION;... just a normal SQL
stored proc...
repeats for quite a while.

A few days later, after a restart, we are seeing these showing up quite
often:

2008-08-26 11:59:42 FATAL:  the database system is starting up
2008-08-26 11:59:42 FATAL:  the database system is starting up
2008-08-26 11:59:43 FATAL:  the database system is starting up
2008-08-26 11:59:43 FATAL:  the database system is starting up
2008-08-26 11:59:43 FATAL:  the database system is starting up
2008-08-26 11:59:43 LOG:  database system is ready
2008-08-26 11:59:55 PANIC:  right sibling's left-link doesn't match
2008-08-26 11:59:55 STATEMENT:  BEGIN TRANSACTION;INSERT INTO ...SQL
scrubbed...
This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.
2008-08-26 11:59:55 LOG:  server process (PID 2228) exited with exit code 3
2008-08-26 11:59:55 LOG:  terminating any other active server processes
2008-08-26 11:59:55 LOG:  all server processes terminated; reinitializing
2008-08-26 11:59:55 LOG:  database system was interrupted at 2008-08-26
11:59:43 Pacific Daylight Time
2008-08-26 11:59:55 LOG:  checkpoint record is at E2/F88B6C0
2008-08-26 11:59:55 LOG:  redo record is at E2/F88B6C0; undo record is
at 0/0; shutdown TRUE
2008-08-26 11:59:55 LOG:  next transaction ID: 0/396816257; next OID: 58100
2008-08-26 11:59:55 LOG:  next MultiXactId: 3; next MultiXactOffset: 5
2008-08-26 11:59:55 LOG:  database system was not properly shut down;
automatic recovery in progress
2008-08-26 11:59:55 LOG:  redo starts at E2/F88B710
2008-08-26 11:59:55 LOG:  record with zero length at E2/F984928
2008-08-26 11:59:55 LOG:  redo done at E2/F9848F8
2008-08-26 11:59:55 FATAL:  the database system is starting up
2008-08-26 11:59:56 FATAL:  the database system is starting up
2008-08-26 11:59:56 FATAL:  the database system is starting up
2008-08-26 11:59:56 FATAL:  the database system is starting up
2008-08-26 11:59:56 FATAL:  the database system is starting up
2008-08-26 11:59:56 FATAL:  the database system is starting up
2008-08-26 11:59:56 FATAL:  the database system is starting up
2008-08-26 11:59:56 LOG:  database system is ready

That section is repeating over and over.  Oddly enough, the system
actually seems to be running mostly.  I need to do some diagnostics of
our app to see what is going on at that layer and what is and isn't working.

I found an article online with a similar problem, but no resolution:
http://www.mydatabasesupport.com/forums/postgresql/399079-general-failing-recover-after-panic-shutdown.html


Re: Do I have a corrupted database?

From
Craig Ringer
Date:
William Garrison wrote:
> I fear I have a corrupted database, and I'm not sure what to do.

First, make sure you have a recent backup. If your backups rotate, stop
the rotation so that all currently available historical copies of the
database are preserved from now on - just in case you need them.

Now, if possible dump your database with pg_dump. Restore the dump to a
test database instance and make sure that it all goes OK.

Once that's done, so you know you have a decent recovery point to work
from in case you make a mistake during your recovery efforts.

After that I don't have all that much to offer, especially as you're
using an operating system I don't have much experience with Pg on and
you're using an (unspecified) SAN.

Normally I'd ask if you'd verified your RAID array / tested your disks.
In this case, I'm wondering if there's any chance there was a service
interruption on the SAN that might've caused some sort of intermittent
or partial writes.

> 2008-08-23 20:00:27 ERROR:  xlog flush request E0/293CF278 is not
> satisfied --- flushed only to E0/21B1B7F0
> 2008-08-23 20:00:27 CONTEXT:  writing block 94218 of relation
> 16712/16713/16725
> 2008-08-23 20:04:36 DETAIL:  Multiple failures --- write error may be
> permanent.

Yeah, I'm really wondering about the SAN and SAN connection. What sort
of SAN is it? How is the host connected? Does it have any sort of
logging and monitoring that might let you see if there was a problem
around the time Pg was complaining?

Have you checked the Windows error logs?

--
Craig Ringer

Re: Do I have a corrupted database?

From
William Garrison
Date:
Craig Ringer wrote:
> William Garrison wrote:
>
>> I fear I have a corrupted database, and I'm not sure what to do.
>>
>
> First, make sure you have a recent backup. If your backups rotate, stop
> the rotation so that all currently available historical copies of the
> database are preserved from now on - just in case you need them.
>

Since I made my post, we found that we can't do a pg_dump. :(  Every
time this error appears in the logs, postgres forcably closes any
connections (including any running instances of pgadmin or pg_dump) when
it runs this little recovery process.  We have backups from some days
ago plus transaction logs.  We also have a snapshot of the file system,
and I'm hoping to find a way to attach that onto another system.  I've
had trouble with that in the past.

As for the SAN and the Windows event log: Our IT guy says the SAN
reported no failures at the time.  I don't know much about the SAN
itself, I just know it uses dual fiber-channels and all the drives are
in some RAID array.  I think it also is hardened against nuclear strikes
and has a built-in laser defense system.  At the time of the problem,
the Windows event log indicates no problems writing to the drives, or
any other failures of any kind really.  No other apps crashed, no
unusual memory usage, plenty of disk space.  So the cause is a complete
mystery.  :(  So for now, I'm focused on repair.

We tried to REINDEX each table, and we are getting duplicate key errors
so the reindex fails.  I can fix those records manually, but I was
hoping to dump the database, find the duplicates using another system,
then delete/repair the bad records and restore onto the production
machine.  But since the backup/restore isn't working, that isn't looking
like a viable option.

Are there any kind of repair tools for a postgres database?  Any sort of
routine where I can take it offline and run like pg_fsck --all and it
will come back with a report or a repair procedure?
> Now, if possible dump your database with pg_dump. Restore the dump to a
> test database instance and make sure that it all goes OK.
>
> Once that's done, so you know you have a decent recovery point to work
> from in case you make a mistake during your recovery efforts.
>
> After that I don't have all that much to offer, especially as you're
> using an operating system I don't have much experience with Pg on and
> you're using an (unspecified) SAN.
>
> Normally I'd ask if you'd verified your RAID array / tested your disks.
> In this case, I'm wondering if there's any chance there was a service
> interruption on the SAN that might've caused some sort of intermittent
> or partial writes.
>
>
>> 2008-08-23 20:00:27 ERROR:  xlog flush request E0/293CF278 is not
>> satisfied --- flushed only to E0/21B1B7F0
>> 2008-08-23 20:00:27 CONTEXT:  writing block 94218 of relation
>> 16712/16713/16725
>> 2008-08-23 20:04:36 DETAIL:  Multiple failures --- write error may be
>> permanent.
>>
>
> Yeah, I'm really wondering about the SAN and SAN connection. What sort
> of SAN is it? How is the host connected? Does it have any sort of
> logging and monitoring that might let you see if there was a problem
> around the time Pg was complaining?
>
> Have you checked the Windows error logs?
>
> --
> Craig Ringer
>
>


Re: Do I have a corrupted database?

From
Martijn van Oosterhout
Date:
On Wed, Aug 27, 2008 at 01:45:43PM -0400, William Garrison wrote:
> Since I made my post, we found that we can't do a pg_dump. :(  Every
> time this error appears in the logs, postgres forcably closes any
> connections (including any running instances of pgadmin or pg_dump) when
> it runs this little recovery process.  We have backups from some days
> ago plus transaction logs.  We also have a snapshot of the file system,
> and I'm hoping to find a way to attach that onto another system.  I've
> had trouble with that in the past.

You're going to have to be more specific. What do you mean by "this
error"? It is possible to startup postgresql such that it will not use
any system indexes.

> Are there any kind of repair tools for a postgres database?  Any sort of
> routine where I can take it offline and run like pg_fsck --all and it
> will come back with a report or a repair procedure?

There is no tools that do fixing, only the DB server itself. If you
can't get it to work within postgresql, then pgfsck can attempt to do a
raw data dump. It doesn't guarentee the integrity of the data but it
may be able to get your data out.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: Do I have a corrupted database?

From
William Garrison
Date:
Wait... there really is a pgfsck...?  I just made that up as an example of something I wanted.  Great!  And... how would I tell postgres to start without using any indexes?

Martijn van Oosterhout wrote:
On Wed, Aug 27, 2008 at 01:45:43PM -0400, William Garrison wrote: 
Are there any kind of repair tools for a postgres database?  Any sort of 
routine where I can take it offline and run like pg_fsck --all and it 
will come back with a report or a repair procedure?   
You're going to have to be more specific. What do you mean by "this
error"? It is possible to startup postgresql such that it will not use
any system indexes.

There is no tools that do fixing, only the DB server itself. If you
can't get it to work within postgresql, then pgfsck can attempt to do a
raw data dump. It doesn't guarentee the integrity of the data but it
may be able to get your data out.

Have a nice day, 

Re: Do I have a corrupted database?

From
Craig Ringer
Date:
William Garrison wrote:
> Wait... there really is a pgfsck...?  I just made that up as an example
> of something I wanted.

I had no idea either, but it does look like it:

http://svana.org/kleptog/pgsql/pgfsck.html

It's a perl script, so you may have a decent chance of getting it going
on win32. It doesn't appear to officially support 8.3, but that doesn't
look like it'll be an issue for you.

Interesting that it doesn't appear to be on pgfoundry:

http://pgfoundry.org/search/?type_of_search=soft&words=fsck&Search=Search

http://pgfoundry.org/search/?type_of_search=soft&words=pgfsck&Search=Search

>  Great!  And... how would I tell postgres to
> start without using any indexes?

http://www.postgresql.org/docs/8.2/static/app-postgres.html

On Windows, you'd probably stop the PostgreSQL service then invoke the
PostgreSQL server (postgres.exe) manually from a cmd.exe shell, using
runas.exe to run it under the postgres user ID.

If that's right, it'd be nice if you could reply with the exact command
line you land up using so the documentation can be updated to show the
appropriate one-liner for Windows users to put the server in recovery mode.

--
Craig Ringer