Thread: Corrupted Data?

Corrupted Data?

From

"Matthew T. O'Connor"

Date:

14 September 2004, 04:12:52

Hello, pg_dump started failing for one of my databases, so I looked in
to it and it appears that I have some corrupted data or something.  I
assume this is related to a failed hard disk that was part of the linux
software raid mirror.

I backed up the entire data directory, and did a pg_resetxlog, but that
didn't help.  I found the specific row that seems to be the problem, but
I can't delete it.

Anyway, I don't know how to fix this, so if you could please help, I
would appreciate it.

Details are as follows:

[dbmail2@dezeut dbmail2]$ psql
Welcome to psql 7.4.2, the PostgreSQL interactive terminal.
dbmail2=# SELECT version();
                                                 version
---------------------------------------------------------------------------------------------------------
 PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
20030222 (Red Hat Linux 3.2.2-5)

dbmail2=# SELECT oid,messageblk_idnr, physmessage_id, blocksize from
messageblks where messageblk_idnr =7718;
   oid   | messageblk_idnr | physmessage_id | blocksize
---------+-----------------+----------------+-----------
 2916427 |            7718 |           3842 |    524288
(1 row)

dbmail2=# SELECT oid,messageblk_idnr, physmessage_id, blocksize,
messageblk from messageblks where messageblk_idnr =7718;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q
[dbmail2@dezeut dbmail2]$ psql
Welcome to psql 7.4.2, the PostgreSQL interactive terminal.
dbmail2=# delete from messageblks where oid = 2916427;
ERROR:  could not access status of transaction 3822646358
DETAIL:  could not open file "/var/lib/pgsql/data/pg_clog/0E3D": No such
file or directory

Re: Corrupted Data?

From

Chester Kustarz

Date:

14 September 2004, 05:46:32

On Mon, 13 Sep 2004, Matthew T. O'Connor wrote:
> I backed up the entire data directory, and did a pg_resetxlog, but that
> didn't help.  I found the specific row that seems to be the problem, but
> I can't delete it.

I have used TRUNCATE on the table in this situation to recover. Another
option might be to DROP the table. Or perhaps restore from backups.

Re: Corrupted Data?

From

"Matthew T. O'Connor"

Date:

14 September 2004, 13:13:41

On Tue, 2004-09-14 at 00:46, Chester Kustarz wrote:
> On Mon, 13 Sep 2004, Matthew T. O'Connor wrote:
> > I backed up the entire data directory, and did a pg_resetxlog, but that
> > didn't help.  I found the specific row that seems to be the problem, but
> > I can't delete it.
>
> I have used TRUNCATE on the table in this situation to recover. Another
> option might be to DROP the table. Or perhaps restore from backups.

I would really prefer not to do that as pg_dump has apparently been
failing for a while so I would lose a fair amount of data.

Re: Corrupted Data?

From

Alvaro Herrera

Date:

14 September 2004, 15:13:28

On Tue, Sep 14, 2004 at 08:13:24AM -0400, Matthew T. O'Connor wrote:
> On Tue, 2004-09-14 at 00:46, Chester Kustarz wrote:
> > On Mon, 13 Sep 2004, Matthew T. O'Connor wrote:
> > > I backed up the entire data directory, and did a pg_resetxlog, but that
> > > didn't help.  I found the specific row that seems to be the problem, but
> > > I can't delete it.
> >
> > I have used TRUNCATE on the table in this situation to recover. Another
> > option might be to DROP the table. Or perhaps restore from backups.
>
> I would really prefer not to do that as pg_dump has apparently been
> failing for a while so I would lose a fair amount of data.

You can create a pg_clog file (the one it's complaining about) filled
with zeros, using
dd if=/dev/zero bs=1k count=8 of=/path/to/data/pg_clog/0E3D

and then you should be able to pg_dump the table (or at least find out
if there is another corrupted tuple.)  Beware that the corrupted tuple
may be in there if it's supposed not to be, or it may not be if it's
supposed to be.

After you get your data back, I'd suggest running the usual hardware
checking tools, and restore from the backup.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Use it up, wear it out, make it do, or do without"

Re: Corrupted Data?

From

"Matthew T. O'Connor"

Date:

15 September 2004, 02:07:53

Alvaro Herrera wrote:
> On Tue, Sep 14, 2004 at 08:13:24AM -0400, Matthew T. O'Connor wrote:
> You can create a pg_clog file (the one it's complaining about) filled
> with zeros, using
> dd if=/dev/zero bs=1k count=8 of=/path/to/data/pg_clog/0E3D

Ok, I tried this, and it changed the error but hasn't fixed the problem
now I get this:

[dbmail2@dezeut dbmail2]$ psql
Welcome to psql 7.4.2, the PostgreSQL interactive terminal.
dbmail2=# delete from messageblks where messageblk_idnr = 7718;
ERROR:  could not access status of transaction 3822646358
DETAIL:  could not read from file "/var/lib/pgsql/data/pg_clog/0E3D" at
offset 139264: Success

And in the log file I get this:

ERROR:  XX000: could not access status of transaction 3822646358
DETAIL:  could not read from file "/var/lib/pgsql/data/pg_clog/0E3D" at
offset 139264: Success
LOCATION:  SlruReportIOError, slru.c:634

Any more thoughts?

Thanks again,

Matthew

Re: Corrupted Data?

From

Alvaro Herrera

Date:

15 September 2004, 02:38:54

On Tue, Sep 14, 2004 at 08:01:13PM -0400, Matthew T. O'Connor wrote:
> Alvaro Herrera wrote:
> >On Tue, Sep 14, 2004 at 08:13:24AM -0400, Matthew T. O'Connor wrote:
> >You can create a pg_clog file (the one it's complaining about) filled
> >with zeros, using
> >dd if=/dev/zero bs=1k count=8 of=/path/to/data/pg_clog/0E3D
>
> Ok, I tried this, and it changed the error but hasn't fixed the problem
> now I get this:
>
> [dbmail2@dezeut dbmail2]$ psql
> Welcome to psql 7.4.2, the PostgreSQL interactive terminal.
> dbmail2=# delete from messageblks where messageblk_idnr = 7718;
> ERROR:  could not access status of transaction 3822646358
> DETAIL:  could not read from file "/var/lib/pgsql/data/pg_clog/0E3D" at
> offset 139264: Success

Huh, sorry, the directions only created the first block of the file, but
you needed the 17th ...

dd if=/dev/zero bs=8k count=17 of=/path/to/data/pg_clog/0E3D

I may be subject of a fencepost problem here, so if it doesn't work try
with 18.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Limítate a mirar... y algun día veras"

Re: Corrupted Data?

From

"Matthew T. O'Connor"

Date:

15 September 2004, 03:01:44

Alvaro Herrera wrote:
> Huh, sorry, the directions only created the first block of the file, but
> you needed the 17th ...
>
> dd if=/dev/zero bs=8k count=17 of=/path/to/data/pg_clog/0E3D
>
> I may be subject of a fencepost problem here, so if it doesn't work try
> with 18.

I don't know if we are making progress but I am getting a different
error now :-)

I did the dd command again this time with count=18.  Now when I try to
delete the tuple I get this:

dbmail2=# delete from messageblks where oid = 2916427;
ERROR:  attempted to delete invisible tuple

The Postmaster log has this to say:

ERROR:  XX000: attempted to delete invisible tuple
LOCATION:  heap_delete, heapam.c:1258


Thanks again for the help.

Matthew

Re: Corrupted Data?

From

Alvaro Herrera

Date:

15 September 2004, 05:38:52

On Tue, Sep 14, 2004 at 10:01:21PM -0400, Matthew T. O'Connor wrote:
> Alvaro Herrera wrote:
> >Huh, sorry, the directions only created the first block of the file, but
> >you needed the 17th ...
> >
> >dd if=/dev/zero bs=8k count=17 of=/path/to/data/pg_clog/0E3D
> >
> >I may be subject of a fencepost problem here, so if it doesn't work try
> >with 18.
>
> I don't know if we are making progress but I am getting a different
> error now :-)
>
> I did the dd command again this time with count=18.  Now when I try to
> delete the tuple I get this:
>
> dbmail2=# delete from messageblks where oid = 2916427;
> ERROR:  attempted to delete invisible tuple

I think I know what is going on, but I'm not sure how to solve the
problem.  If I were in your situation I'd edit the data file and stash
FrozenTransactionId in the Xmin and InvalidTransactionId in Xmax for
that tuple.  Short of using an hex editor, I'm not sure how to do that,
however, and before doing anything that foolish I'd backup the file two
or three times just to be sure.

You may try using pgfsck (http://svana.org/kleptog/pgsql/pgfsck.html) or
pg_filedump (http://sources.redhat.com/rhdb) and see how lucky you get
with the hex editor ...

(memories of cheating in VGA Planets by use of said hex editor many
years ago now come to my mind ...)

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Major Fambrough: You wish to see the frontier?
John Dunbar: Yes sir, before it's gone.

Re: Corrupted Data?

From

"Matthew T. O'Connor"

Date:

15 September 2004, 19:37:30

Alvaro Herrera wrote:
> On Tue, Sep 14, 2004 at 10:01:21PM -0400, Matthew T. O'Connor wrote:
>>I don't know if we are making progress but I am getting a different
>>error now :-)
>>
>>I did the dd command again this time with count=18.  Now when I try to
>>delete the tuple I get this:
>>
>>dbmail2=# delete from messageblks where oid = 2916427;
>>ERROR:  attempted to delete invisible tuple

Well after using dd to create a few missing pg_clog files, I was finally
able to do a vacuum of the whole database which allowed me to delete the
problematic tuple which allowed me to do a pg_dump of the database! So
finally some progress.

However, I then ran into a new problem while trying to dump another
database.  Now I get this:

dbmail=# SELECT * from messageblks ;
ERROR:  invalid page header in block 85646 of relation "pg_toast_2353340"

Any ideas on this new issue?

> You may try using pgfsck (http://svana.org/kleptog/pgsql/pgfsck.html) or
> pg_filedump (http://sources.redhat.com/rhdb) and see how lucky you get
> with the hex editor ...

I looked at pgfsck and it seems that pgfsck was last updated for 7.3.
I'll take a look at pg_filedump.

Thanks again,

Matthew

Re: Corrupted Data?

From

Alvaro Herrera

Date:

15 September 2004, 19:48:16

On Wed, Sep 15, 2004 at 02:37:13PM -0400, Matthew T. O'Connor wrote:

> However, I then ran into a new problem while trying to dump another
> database.  Now I get this:
>
> dbmail=# SELECT * from messageblks ;
> ERROR:  invalid page header in block 85646 of relation "pg_toast_2353340"
>
> Any ideas on this new issue?

IMO this is FUBAR ... try enabling zero_damaged_pages.  Beware that data
on damaged pages will be lost.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)