Re: Do I have a corrupted database? - Mailing list pgsql-general

From William Garrison
Subject Re: Do I have a corrupted database?
Date
Msg-id 48B592C7.5020900@mobydisk.com
Whole thread Raw
In response to Re: Do I have a corrupted database?  (Craig Ringer <craig@postnewspapers.com.au>)
Responses Re: Do I have a corrupted database?  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-general
Craig Ringer wrote:
> William Garrison wrote:
>
>> I fear I have a corrupted database, and I'm not sure what to do.
>>
>
> First, make sure you have a recent backup. If your backups rotate, stop
> the rotation so that all currently available historical copies of the
> database are preserved from now on - just in case you need them.
>

Since I made my post, we found that we can't do a pg_dump. :(  Every
time this error appears in the logs, postgres forcably closes any
connections (including any running instances of pgadmin or pg_dump) when
it runs this little recovery process.  We have backups from some days
ago plus transaction logs.  We also have a snapshot of the file system,
and I'm hoping to find a way to attach that onto another system.  I've
had trouble with that in the past.

As for the SAN and the Windows event log: Our IT guy says the SAN
reported no failures at the time.  I don't know much about the SAN
itself, I just know it uses dual fiber-channels and all the drives are
in some RAID array.  I think it also is hardened against nuclear strikes
and has a built-in laser defense system.  At the time of the problem,
the Windows event log indicates no problems writing to the drives, or
any other failures of any kind really.  No other apps crashed, no
unusual memory usage, plenty of disk space.  So the cause is a complete
mystery.  :(  So for now, I'm focused on repair.

We tried to REINDEX each table, and we are getting duplicate key errors
so the reindex fails.  I can fix those records manually, but I was
hoping to dump the database, find the duplicates using another system,
then delete/repair the bad records and restore onto the production
machine.  But since the backup/restore isn't working, that isn't looking
like a viable option.

Are there any kind of repair tools for a postgres database?  Any sort of
routine where I can take it offline and run like pg_fsck --all and it
will come back with a report or a repair procedure?
> Now, if possible dump your database with pg_dump. Restore the dump to a
> test database instance and make sure that it all goes OK.
>
> Once that's done, so you know you have a decent recovery point to work
> from in case you make a mistake during your recovery efforts.
>
> After that I don't have all that much to offer, especially as you're
> using an operating system I don't have much experience with Pg on and
> you're using an (unspecified) SAN.
>
> Normally I'd ask if you'd verified your RAID array / tested your disks.
> In this case, I'm wondering if there's any chance there was a service
> interruption on the SAN that might've caused some sort of intermittent
> or partial writes.
>
>
>> 2008-08-23 20:00:27 ERROR:  xlog flush request E0/293CF278 is not
>> satisfied --- flushed only to E0/21B1B7F0
>> 2008-08-23 20:00:27 CONTEXT:  writing block 94218 of relation
>> 16712/16713/16725
>> 2008-08-23 20:04:36 DETAIL:  Multiple failures --- write error may be
>> permanent.
>>
>
> Yeah, I'm really wondering about the SAN and SAN connection. What sort
> of SAN is it? How is the host connected? Does it have any sort of
> logging and monitoring that might let you see if there was a problem
> around the time Pg was complaining?
>
> Have you checked the Windows error logs?
>
> --
> Craig Ringer
>
>


pgsql-general by date:

Previous
From: "Scott Marlowe"
Date:
Subject: Re: Dumping/Restoring with constraints?
Next
From: William Garrison
Date:
Subject: Restoring a database from a file system snapshot