Thread: bad block problem
It appears that I have a bad block/disk sector etc., which is preventing me from retrieving the rows from this table. All other tables within this database is fine. In preparation for zeroing out the bad block I tried to do a cold backup/ copy cp -r * ../data2/ and received the following from cp cp: base/9221176/9221183: I/O error So I set the parameter zero_damaged_pages=true and tried running vacuum on the table vacuum full docs; ERROR: could not read block 67680 of relation base/9221176/9221183: I/O error Next I looked at the directories in /data/base/9221176 -bash-3.00$ ls 9221183* 9221183 9221183.1 9221183.2 9221183_fsm 9221183_vm I am trying to find the file that contains the bad block. Calculating the chunk number gives me 9221183/ 131072 = 70.352043152 But this .xx doesn’t exist. Also I’m not sure what _fsm or _vm file is or how to proceed. I do not have a recent backup of this database/table. It contains a bytea field which stores documents and is quite large compared to the rest of the database., Any help would be appreciated. Thanks
On Wed, 07 Dec 2011 21:22:05 +0000, jkells wrote: > I do not have a recent backup of this database/table > > Any help would be appreciated. Here's some help: Next time you establish a database, set up and test the backup regime. We hear this tale of woe time and time again. I have *no* sympathy.
On Wed, 07 Dec 2011 22:09:23 +0000, Walter Hurry wrote: > On Wed, 07 Dec 2011 21:22:05 +0000, jkells wrote: > >> I do not have a recent backup of this database/table >> >> Any help would be appreciated. > > Here's some help: Next time you establish a database, set up and test > the backup regime. > > We hear this tale of woe time and time again. I have *no* sympathy. I not asking for sympathy just stating the facts that there isnt a current backup and I am relying on identifying and correcting a bad block. The question to the group is help in understanding how to identify a bad block with the given information that I was able to obtain.
jkells <jtkells@verizon.net> wrote: > I tried to do a cold backup/copy > cp -r * ../data2/ > > and received the following from cp > cp: base/9221176/9221183: I/O error That sounds like your storage system is failing, quite independently from PostgreSQL. Copy the entire data directory tree to some other medium immediately, and preserve this copy. If you hit bad blocks, retry if possible. If you just can't read some portions of it, you need to get what you can, and make notes of any garbage or missed portions of files. Use the copy as a source to copy onto a reliable storage system. Without knowing more about what sort of storage system you're talking about it is hard to give advice or predict whether it might be fixable somehow. If you try to run a database on failing hardware, it will not be a pleasant experience. -Kevin
On Wed, 07 Dec 2011 22:20:30 +0000, jkells wrote: > I am relying on identifying and correcting a bad block. Well, good luck with that. Most of the time you can't. Just check your disk, replace it if necessary, restore from your backup and roll forward. Oh, you can't do that, since you didn't bother to back up. Never mind.
On 12/08/2011 08:20 AM, Walter Hurry wrote: > On Wed, 07 Dec 2011 22:20:30 +0000, jkells wrote: > >> I am relying on identifying and correcting a bad block. > > Well, good luck with that. Most of the time you can't. Just check your > disk, replace it if necessary, restore from your backup and roll forward. > > Oh, you can't do that, since you didn't bother to back up. Never mind. Unless you're using synchronous replication to clone *every* transaction on commit to a spare machine, you'll still lose transactions on a failure no matter how good your backups are. Even if the OP was doing nightly dumps, they'd be entirely justified in wanting to try to get a more recent dump on failure. If they're not backing up at all, yes, that was dumb, but they know that now. Asking for help isn't unreasonable, and this isn't a stupid "just google it" question. They've made an effort, posted useful info and log output, etc. Please don't be too hard on them. -- Craig Ringer
On 12/08/2011 07:41 AM, Kevin Grittner wrote: > That sounds like your storage system is failing, quite independently > from PostgreSQL. Copy the entire data directory tree to some other > medium immediately, and preserve this copy. If you hit bad blocks, > retry if possible. If you find files you can't copy in their entirety, try using dd_rescue to copy it with a hole for the bad block. dd_rescue is an _incredibly_ useful tool for this, as it'll do bad-block-tolerant copies quickly and efficiently. Once you have a complete copy of your datadir, stop working on the faulty machine. Make your first copy read-only. Duplicate the copy and work on the duplicate when trying to restore. I'd start with enabling zero_damaged_pages to see if you can get a dump that way. Do **NOT** enable zero_damaged_pages on the original. Do it on the duplicate of the copied data. -- Craig Ringer
On Thu, 08 Dec 2011 09:02:15 +0800, Craig Ringer wrote: > On 12/08/2011 08:20 AM, Walter Hurry wrote: >> On Wed, 07 Dec 2011 22:20:30 +0000, jkells wrote: >> >>> I am relying on identifying and correcting a bad block. >> >> Well, good luck with that. Most of the time you can't. Just check your >> disk, replace it if necessary, restore from your backup and roll >> forward. >> >> Oh, you can't do that, since you didn't bother to back up. Never mind. > > Unless you're using synchronous replication to clone *every* transaction > on commit to a spare machine, you'll still lose transactions on a > failure no matter how good your backups are. > > Even if the OP was doing nightly dumps, they'd be entirely justified in > wanting to try to get a more recent dump on failure. > > If they're not backing up at all, yes, that was dumb, but they know that > now. Asking for help isn't unreasonable, and this isn't a stupid "just > google it" question. They've made an effort, posted useful info and log > output, etc. Please don't be too hard on them. > > -- > Craig Ringer For those that replied with suggestions I appreciate your time and effort. For the others, reasons why backups where not done can be many, from just didn't do it, faulty backup process, don't have the space, PIT windows,not allowed to backup data, and the list can go on and on. Under a no backup or insufficient backup policy, we are all aware of the implications and understand the risk. I have no problem working under this policy and other fully functional operational standard policies. As long as your management is aware then issues like this become a trade off and you have done your do diligence. I simply wanted to understand the methods of zeroing out a block on a database file since I was not sure how to interpret the results from following some procedures and write- ups. If successful great, if not then we move the next step of recovery.