Thread: bad block problem

bad block problem

From
jkells
Date:
It appears that I have a bad block/disk sector etc., which is preventing
me from retrieving the rows from this table.  All other tables within
this database is fine.
In preparation for zeroing out the bad block I tried to do a cold backup/
copy
cp  -r * ../data2/

and received the following from cp
cp: base/9221176/9221183: I/O error

So I set the parameter zero_damaged_pages=true
and tried running vacuum on the table
vacuum full docs;
ERROR:  could not read block 67680 of relation base/9221176/9221183: I/O
error


Next I looked at the directories in /data/base/9221176
-bash-3.00$ ls 9221183*
9221183      9221183.1    9221183.2    9221183_fsm  9221183_vm

I am trying to find the file that contains the bad block.  Calculating
the chunk number gives me 9221183/ 131072 = 70.352043152
But this .xx doesn’t exist.
Also I’m not sure what _fsm or _vm file is or how to proceed.    I do not
have a recent backup of this database/table.  It contains a bytea field
which stores documents and is quite large compared to the rest of the
database.,  Any help would be appreciated.


Thanks

Re: bad block problem

From
Walter Hurry
Date:
On Wed, 07 Dec 2011 21:22:05 +0000, jkells wrote:

> I do not have a recent backup of this database/table
>
> Any help would be appreciated.

Here's some help: Next time you establish a database, set up and test the
backup regime.

We hear this tale of woe time and time again. I have *no* sympathy.


Re: bad block problem

From
jkells
Date:
On Wed, 07 Dec 2011 22:09:23 +0000, Walter Hurry wrote:

> On Wed, 07 Dec 2011 21:22:05 +0000, jkells wrote:
>
>> I do not have a recent backup of this database/table
>>
>> Any help would be appreciated.
>
> Here's some help: Next time you establish a database, set up and test
> the backup regime.
>
> We hear this tale of woe time and time again. I have *no* sympathy.

I not asking for sympathy just stating the facts that there isnt a
current backup and I am relying on identifying and correcting a bad
block.  The question to the group is help in understanding how to
identify a bad block with the given information that I was able to
obtain.

Re: bad block problem

From
"Kevin Grittner"
Date:
jkells <jtkells@verizon.net> wrote:

> I tried to do a cold backup/copy
> cp  -r * ../data2/
>
> and received the following from cp
> cp: base/9221176/9221183: I/O error

That sounds like your storage system is failing, quite independently
from PostgreSQL.  Copy the entire data directory tree to some other
medium immediately, and preserve this copy.  If you hit bad blocks,
retry if possible.  If you just can't read some portions of it, you
need to get what you can, and make notes of any garbage or missed
portions of files.  Use the copy as a source to copy onto a reliable
storage system.

Without knowing more about what sort of storage system you're
talking about it is hard to give advice or predict whether it might
be fixable somehow.  If you try to run a database on failing
hardware, it will not be a pleasant experience.

-Kevin

Re: bad block problem

From
Walter Hurry
Date:
On Wed, 07 Dec 2011 22:20:30 +0000, jkells wrote:

> I am relying on identifying and correcting a bad block.

Well, good luck with that. Most of the time you can't. Just check your
disk, replace it if necessary, restore from your backup and roll forward.

Oh, you can't do that, since you didn't bother to back up. Never mind.


Re: bad block problem

From
Craig Ringer
Date:
On 12/08/2011 08:20 AM, Walter Hurry wrote:
> On Wed, 07 Dec 2011 22:20:30 +0000, jkells wrote:
>
>> I am relying on identifying and correcting a bad block.
>
> Well, good luck with that. Most of the time you can't. Just check your
> disk, replace it if necessary, restore from your backup and roll forward.
>
> Oh, you can't do that, since you didn't bother to back up. Never mind.

Unless you're using synchronous replication to clone *every* transaction
on commit to a spare machine, you'll still lose transactions on a
failure no matter how good your backups are.

Even if the OP was doing nightly dumps, they'd be entirely justified in
wanting to try to get a more recent dump on failure.

If they're not backing up at all, yes, that was dumb, but they know that
now. Asking for help isn't unreasonable, and this isn't a stupid "just
google it" question. They've made an effort, posted useful info and log
output, etc. Please don't be too hard on them.

--
Craig Ringer

Re: bad block problem

From
Craig Ringer
Date:
On 12/08/2011 07:41 AM, Kevin Grittner wrote:

> That sounds like your storage system is failing, quite independently
> from PostgreSQL.  Copy the entire data directory tree to some other
> medium immediately, and preserve this copy.  If you hit bad blocks,
> retry if possible.

If you find files you can't copy in their entirety, try using dd_rescue
to copy it with a hole for the bad block. dd_rescue is an _incredibly_
useful tool for this, as it'll do bad-block-tolerant copies quickly and
efficiently.

Once you have a complete copy of your datadir, stop working on the
faulty machine. Make your first copy read-only. Duplicate the copy and
work on the duplicate when trying to restore. I'd start with enabling
zero_damaged_pages to see if you can get a dump that way.

Do **NOT** enable zero_damaged_pages on the original. Do it on the
duplicate of the copied data.

--
Craig Ringer

Re: bad block problem

From
jkells
Date:
On Thu, 08 Dec 2011 09:02:15 +0800, Craig Ringer wrote:

> On 12/08/2011 08:20 AM, Walter Hurry wrote:
>> On Wed, 07 Dec 2011 22:20:30 +0000, jkells wrote:
>>
>>> I am relying on identifying and correcting a bad block.
>>
>> Well, good luck with that. Most of the time you can't. Just check your
>> disk, replace it if necessary, restore from your backup and roll
>> forward.
>>
>> Oh, you can't do that, since you didn't bother to back up. Never mind.
>
> Unless you're using synchronous replication to clone *every* transaction
> on commit to a spare machine, you'll still lose transactions on a
> failure no matter how good your backups are.
>
> Even if the OP was doing nightly dumps, they'd be entirely justified in
> wanting to try to get a more recent dump on failure.
>
> If they're not backing up at all, yes, that was dumb, but they know that
> now. Asking for help isn't unreasonable, and this isn't a stupid "just
> google it" question. They've made an effort, posted useful info and log
> output, etc. Please don't be too hard on them.
>
> --
> Craig Ringer

For those that replied with suggestions I appreciate your time and
effort.  For the others, reasons why backups where not done can be many,
from just didn't do it, faulty backup process, don't have the space, PIT
windows,not allowed to backup data, and the list can go on and on.  Under
a no backup or insufficient backup policy, we are all aware of the
implications and understand the risk. I have no problem working under
this policy and other fully functional operational standard policies. As
long as your management is aware then issues like this become a trade off
and you have done your do diligence.  I simply wanted to understand the
methods of zeroing out a block on a database file since I was not sure
how to interpret the results from following some procedures and write-
ups. If successful great, if not then we move the next step of recovery.