Thread: files under pg_clog directories are missing

files under pg_clog directories are missing

From
Dinesh Bhandary
Date:
  Hello Everybody -

Due to hardware crash we ran into issues where some blocks were
corrupted and some files were missing.
I was able to get over the corrupted blocks ( errmsg - "invalid page
header in block 12345  of realtion x") by setting
zero_damaged_pages=0 and running vacuum afterwards. Now I am running
into situations where pg_clog files are missing (errmsg - "could not
open pg_clog/0D0D). I have a backup which is quite old ( considering
this as a last resort). Is there any other way to fix this problem?

I also created empty blocks to fool postgres, but there are so many of
these file missing I was wondering if there a better/faster way to fix
this problem.

Thanks.

Dinesh

Re: files under pg_clog directories are missing

From
Bruce Momjian
Date:
Dinesh Bhandary wrote:
>   Hello Everybody -
>
> Due to hardware crash we ran into issues where some blocks were
> corrupted and some files were missing.
> I was able to get over the corrupted blocks ( errmsg - "invalid page
> header in block 12345  of realtion x") by setting
> zero_damaged_pages=0 and running vacuum afterwards. Now I am running
> into situations where pg_clog files are missing (errmsg - "could not
> open pg_clog/0D0D). I have a backup which is quite old ( considering
> this as a last resort). Is there any other way to fix this problem?

If those clog numbers are not in your usual range, it usually means the
row is corrupt and is referring to an corrupt xid number, so the row is
bad, not the clog files are missing.

>
> I also created empty blocks to fool postgres, but there are so many of
> these file missing I was wondering if there a better/faster way to fix
> this problem.

Seems you will have to clear those pages too somehow.  I have seen it
done with dd.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: files under pg_clog directories are missing

From
Greg Smith
Date:
Dinesh Bhandary wrote:
> Due to hardware crash we ran into issues where some blocks were
> corrupted and some files were missing.
> I was able to get over the corrupted blocks ( errmsg - "invalid page
> header in block 12345  of realtion x") by setting
> zero_damaged_pages=0 and running vacuum afterwards. Now I am running
> into situations where pg_clog files are missing (errmsg - "could not
> open pg_clog/0D0D). I have a backup which is quite old ( considering
> this as a last resort). Is there any other way to fix this problem?
>
> I also created empty blocks to fool postgres, but there are so many of
> these file missing I was wondering if there a better/faster way to fix
> this problem.

I hope you made a whole backup of the files before you started trying to
fix the damage too.  It's possible to try and fix this using tricks like
zero_damaged_pages and dummy clog files, only to make things worse.  We
do data recovery services here, and I've had to roll back to the
original copy of the data multiple times before in order to try
different things before getting a get good copy of someone's data back
again.  If you don't have a copy of the database yet, do that before you
do any more experimenting with the clog files.

I wrote a summary of links to past work like this you may find useful,
and a little program to create missing pg_clog files that all say "the
transaction you're asking about committed", available at:
http://archives.postgresql.org/pgsql-general/2009-07/msg00985.php

You might further script that to speed up how fast you can fix these as
they pop up, which makes the test/correct cycle time go down.  You might
even write a script that loops over starting the database, looks at the
end of the log file, and if it's yet another one of these missing just
extract its number, recreate it, and start again.

Unfortunately, doing better than that is tricky.  We had to modify the
PostgreSQL source code to automatically create them in order to handle
this safely last time I ran into one of these that was badly corrupted
and missing a whole lot of real clog files, not just ones that were
unlikely to exist.  You should be staring at the numbers of each one of
these as they're requested.  If the range is way outside of the active
clog files you have, that's probably one you can create safely because
it's garbage data anyway.  But if it starts asking for clog files that
are in the middle or near the ends of the set you've got, you may have a
bigger problem on your hands.

P.S. Make sure you dump a whole copy of the database the minute you get
it started again and reload that before you start using it.  You have no
idea what state all of the tables are really in after a crash like this
without such an exercise.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: files under pg_clog directories are missing

From
Scott Marlowe
Date:
As a followup to all this, the OP also needs to figure out why a
crashed server could cause so much corruption.  This is usually the
result of hardware lying about fsync status, i.e. consumer grade
hardware pressed into service as a server.

Re: files under pg_clog directories are missing

From
Greg Smith
Date:
Scott Marlowe wrote:
> As a followup to all this, the OP also needs to figure out why a
> crashed server could cause so much corruption.  This is usually the
> result of hardware lying about fsync status, i.e. consumer grade
> hardware pressed into service as a server.
>

See http://www.postgresql.org/docs/9.0/static/wal-reliability.html for
more information.  Nothing there has changed in 9.0 relative to the
version being run here, but the docs for that version got a big upgrade
over the earlier ones.

Bad memory is always a potential source for issues in this area too.
Mostly if you don't know why the crash happened, either.  If you know
for sure how it crashed, but only then found corruption, Scott's right
that it's more likely to be a lying disk drive instead.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: files under pg_clog directories are missing

From
Dinesh Bhandary
Date:
  Good reminder to backup before you start. Yes, I've made the backup of
the db in its current state.

After reading your posts and doing a slew of research I am inclining
more towards restoring from the last successful dump and reload data
from applications (it seems like it is going to be partial recovery).  I
am concerned that even if we are able to clear all those messages by
whatever means the state of the database will be corrupted. Luckily this
happens to be ready only db and we can live with it.

As for the nature of the corruption I still do know know what kind of
hardware problems led to this; it happened at one of our clients site
and we are still waiting to find out what caused it.  One piece of info
we got was postgres data directory turned into read only partition.

Thanks everyone.

Dinesh

ps@ the ids of the missing clog_files are out of range.



On 9/17/2010 8:32 PM, Greg Smith wrote:
> Dinesh Bhandary wrote:
>> Due to hardware crash we ran into issues where some blocks were
>> corrupted and some files were missing.
>> I was able to get over the corrupted blocks ( errmsg - "invalid page
>> header in block 12345  of realtion x") by setting
>> zero_damaged_pages=0 and running vacuum afterwards. Now I am running
>> into situations where pg_clog files are missing (errmsg - "could not
>> open pg_clog/0D0D). I have a backup which is quite old ( considering
>> this as a last resort). Is there any other way to fix this problem?
>>
>> I also created empty blocks to fool postgres, but there are so many
>> of these file missing I was wondering if there a better/faster way to
>> fix this problem.
>
> I hope you made a whole backup of the files before you started trying
> to fix the damage too.  It's possible to try and fix this using tricks
> like zero_damaged_pages and dummy clog files, only to make things
> worse.  We do data recovery services here, and I've had to roll back
> to the original copy of the data multiple times before in order to try
> different things before getting a get good copy of someone's data back
> again.  If you don't have a copy of the database yet, do that before
> you do any more experimenting with the clog files.
>
> I wrote a summary of links to past work like this you may find useful,
> and a little program to create missing pg_clog files that all say "the
> transaction you're asking about committed", available at:
> http://archives.postgresql.org/pgsql-general/2009-07/msg00985.php
>
> You might further script that to speed up how fast you can fix these
> as they pop up, which makes the test/correct cycle time go down.  You
> might even write a script that loops over starting the database, looks
> at the end of the log file, and if it's yet another one of these
> missing just extract its number, recreate it, and start again.
>
> Unfortunately, doing better than that is tricky.  We had to modify the
> PostgreSQL source code to automatically create them in order to handle
> this safely last time I ran into one of these that was badly corrupted
> and missing a whole lot of real clog files, not just ones that were
> unlikely to exist.  You should be staring at the numbers of each one
> of these as they're requested.  If the range is way outside of the
> active clog files you have, that's probably one you can create safely
> because it's garbage data anyway.  But if it starts asking for clog
> files that are in the middle or near the ends of the set you've got,
> you may have a bigger problem on your hands.
>
> P.S. Make sure you dump a whole copy of the database the minute you
> get it started again and reload that before you start using it.  You
> have no idea what state all of the tables are really in after a crash
> like this without such an exercise.
>


Re: files under pg_clog directories are missing

From
Bruce Momjian
Date:
Dinesh Bhandary wrote:
>   Good reminder to backup before you start. Yes, I've made the backup of
> the db in its current state.
>
> After reading your posts and doing a slew of research I am inclining
> more towards restoring from the last successful dump and reload data
> from applications (it seems like it is going to be partial recovery).  I
> am concerned that even if we are able to clear all those messages by
> whatever means the state of the database will be corrupted. Luckily this

True.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: files under pg_clog directories are missing

From
Greg Smith
Date:
Dinesh Bhandary wrote:
> As for the nature of the corruption I still do know know what kind of
> hardware problems led to this; it happened at one of our clients site
> and we are still waiting to find out what caused it.  One piece of
> info we got was postgres data directory turned into read only partition.

That can happen when a major disk-level problem occurs; system remounts
as read-only because it doesn't think it can safely write to it
anymore.  I'd check into the system kernel logs as soon as possible, to
try and find a disk error that kicked the whole thing off.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book