Thread: Database corruption.

Database corruption.

From
Michael Guerin
Date:
Hi,

   Our database filled up and now I'm getting this error on one of the
tables.  Is there any way to recover from this?  Please let me know if
more information is needed.

pg_version                                   version
--------------------------------------------------------------------------------
 PostgreSQL 8.1RC1 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.0.1
(1 row)

I've restarted the database and drop the index on the table, recreating
gives me the same error:

db1=# create index idx_timeseries on tbltimeseries(timeseriesid)
tablespace tblspc_index;
ERROR:  could not access status of transaction 3376354368
DETAIL:  could not open file "pg_clog/0324": No such file or directory
db1=#

-Michael


Re: Database corruption.

From
Michael Guerin
Date:
Also, all files in pg_clog are sequential with the last file being 0135.


Michael Guerin wrote:
> Hi,
>
>   Our database filled up and now I'm getting this error on one of the
> tables.  Is there any way to recover from this?  Please let me know if
> more information is needed.
>
> pg_version                                   version
> --------------------------------------------------------------------------------
>
> PostgreSQL 8.1RC1 on x86_64-unknown-linux-gnu, compiled by GCC gcc
> (GCC) 4.0.1
> (1 row)
>
> I've restarted the database and drop the index on the table,
> recreating gives me the same error:
>
> db1=# create index idx_timeseries on tbltimeseries(timeseriesid)
> tablespace tblspc_index;
> ERROR:  could not access status of transaction 3376354368
> DETAIL:  could not open file "pg_clog/0324": No such file or directory
> db1=#
>
> -Michael
>
>


Re: Database corruption.

From
Tom Lane
Date:
Michael Guerin <guerin@rentec.com> writes:
> Also, all files in pg_clog are sequential with the last file being 0135.

Hmm, that makes it sound like a plain old data-corruption problem, ie,
trashed xmin or xmax in some tuple header.  Can you do a "select
count(*)" from this table without getting the error?

>> PostgreSQL 8.1RC1 on x86_64-unknown-linux-gnu, compiled by GCC gcc
>> (GCC) 4.0.1

[ raised eyebrow... ]  You're keeping important data in an RC version
that you've not updated for more than a year?  Your faith in Postgres'
code quality is touching, but I'd counsel getting onto a more recent
release.

            regards, tom lane

Re: Database corruption.

From
Michael Guerin
Date:
> Hmm, that makes it sound like a plain old data-corruption problem, ie,
> trashed xmin or xmax in some tuple header.  Can you do a "select
> count(*)" from this table without getting the error?
>
>
no,  select count(*) fails around 25 millions rows.
>>> PostgreSQL 8.1RC1 on x86_64-unknown-linux-gnu, compiled by GCC gcc
>>> (GCC) 4.0.1
>>>
>
> [ raised eyebrow... ]  You're keeping important data in an RC version
> that you've not updated for more than a year?  Your faith in Postgres'
> code quality is touching, but I'd counsel getting onto a more recent
> release.
>
>

>             regards, tom lane
>


Re: Database corruption.

From
Tom Lane
Date:
Michael Guerin <guerin@rentec.com> writes:
>> Hmm, that makes it sound like a plain old data-corruption problem, ie,
>> trashed xmin or xmax in some tuple header.  Can you do a "select
>> count(*)" from this table without getting the error?
>>
> no,  select count(*) fails around 25 millions rows.

OK, so you should be able to narrow down the corrupted row(s) and zero
them out, which'll at least let you get back the rest of the table.
See past archives for the standard divide-and-conquer approach to this.

            regards, tom lane

Re: Database corruption.

From
Michael Guerin
Date:
Tom Lane wrote:
> Michael Guerin <guerin@rentec.com> writes:
>
>>> Hmm, that makes it sound like a plain old data-corruption problem, ie,
>>> trashed xmin or xmax in some tuple header.  Can you do a "select
>>> count(*)" from this table without getting the error?
>>>
>>>
>> no,  select count(*) fails around 25 millions rows.
>>
>
> OK, so you should be able to narrow down the corrupted row(s) and zero
> them out, which'll at least let you get back the rest of the table.
> See past archives for the standard divide-and-conquer approach to this.
>
>             regards, tom lane
>
Ok, so I'm trying to track down the rows now (big table slow queries :(
)  How does one zero out a corrupt row, plain delete?  I see references
for creating the missing pg_clog file but I don't believe that's what
you're suggesting..

-michael


Re: Database corruption.

From
Tom Lane
Date:
Michael Guerin <guerin@rentec.com> writes:
> Ok, so I'm trying to track down the rows now (big table slow queries :(
> )  How does one zero out a corrupt row, plain delete?  I see references
> for creating the missing pg_clog file but I don't believe that's what
> you're suggesting..

Zeroing out the whole block containing it is the usual recipe.  I forget
the exact command but if you trawl the archives for mention of "dd" and
"/dev/zero" you'll probably find it.  Keep in mind you want to stop the
postmaster first, to ensure it doesn't have a copy of the bad block
cached in memory.

            regards, tom lane

Re: Database corruption.

From
Michael Guerin
Date:
> Zeroing out the whole block containing it is the usual recipe.  I forget
> the exact command but if you trawl the archives for mention of "dd" and
> "/dev/zero" you'll probably find it.  Keep in mind you want to stop the
> postmaster first, to ensure it doesn't have a copy of the bad block
> cached in memory.
>
>             regards, tom lane
>
You're suggesting to zero out the block in the underlying table files,
or creating the missing pg_clog file and start filling with zero's?  I
just want to clarify, b/c all references I've read so far dealing with
this error talk about zapping the pg_clog file?

Would it be safer to just find the row(s) that are having problems,
create the missing clog file to get past the error, then delete these
rows from the db?

Re: Database corruption.

From
Tom Lane
Date:
Michael Guerin <guerin@rentec.com> writes:
> You're suggesting to zero out the block in the underlying table files,
> or creating the missing pg_clog file and start filling with zero's?

The former.  Making up clog data is unlikely to help --- the bad xmin is
just the first symptom of what's probably more widespread corruption in
the table page.  If you want, once you've identified the page containing
the problem, use "pg_filedump -i -f" to dump out that page and we can
have a look to see just how bad it is.

            regards, tom lane

Re: Database corruption.

From
"Brusser, Michael"
Date:
>> Zeroing out the whole block containing it is the usual recipe.

Something like this worked for me in the past:

% dd bs=8k count=X < /dev/zero >> clog-file

I had to calculate X, because I usually had a situation with truncated
clog-file, and a failed attempt to read it from offset XYZ.
And I used pg_controldata to check on the block-size.

Mike.