Thread: Corrupted Data?
Hello, pg_dump started failing for one of my databases, so I looked in to it and it appears that I have some corrupted data or something. I assume this is related to a failed hard disk that was part of the linux software raid mirror. I backed up the entire data directory, and did a pg_resetxlog, but that didn't help. I found the specific row that seems to be the problem, but I can't delete it. Anyway, I don't know how to fix this, so if you could please help, I would appreciate it. Details are as follows: [dbmail2@dezeut dbmail2]$ psql Welcome to psql 7.4.2, the PostgreSQL interactive terminal. dbmail2=# SELECT version(); version --------------------------------------------------------------------------------------------------------- PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) dbmail2=# SELECT oid,messageblk_idnr, physmessage_id, blocksize from messageblks where messageblk_idnr =7718; oid | messageblk_idnr | physmessage_id | blocksize ---------+-----------------+----------------+----------- 2916427 | 7718 | 3842 | 524288 (1 row) dbmail2=# SELECT oid,messageblk_idnr, physmessage_id, blocksize, messageblk from messageblks where messageblk_idnr =7718; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> \q [dbmail2@dezeut dbmail2]$ psql Welcome to psql 7.4.2, the PostgreSQL interactive terminal. dbmail2=# delete from messageblks where oid = 2916427; ERROR: could not access status of transaction 3822646358 DETAIL: could not open file "/var/lib/pgsql/data/pg_clog/0E3D": No such file or directory
On Mon, 13 Sep 2004, Matthew T. O'Connor wrote: > I backed up the entire data directory, and did a pg_resetxlog, but that > didn't help. I found the specific row that seems to be the problem, but > I can't delete it. I have used TRUNCATE on the table in this situation to recover. Another option might be to DROP the table. Or perhaps restore from backups.
On Tue, 2004-09-14 at 00:46, Chester Kustarz wrote: > On Mon, 13 Sep 2004, Matthew T. O'Connor wrote: > > I backed up the entire data directory, and did a pg_resetxlog, but that > > didn't help. I found the specific row that seems to be the problem, but > > I can't delete it. > > I have used TRUNCATE on the table in this situation to recover. Another > option might be to DROP the table. Or perhaps restore from backups. I would really prefer not to do that as pg_dump has apparently been failing for a while so I would lose a fair amount of data.
On Tue, Sep 14, 2004 at 08:13:24AM -0400, Matthew T. O'Connor wrote: > On Tue, 2004-09-14 at 00:46, Chester Kustarz wrote: > > On Mon, 13 Sep 2004, Matthew T. O'Connor wrote: > > > I backed up the entire data directory, and did a pg_resetxlog, but that > > > didn't help. I found the specific row that seems to be the problem, but > > > I can't delete it. > > > > I have used TRUNCATE on the table in this situation to recover. Another > > option might be to DROP the table. Or perhaps restore from backups. > > I would really prefer not to do that as pg_dump has apparently been > failing for a while so I would lose a fair amount of data. You can create a pg_clog file (the one it's complaining about) filled with zeros, using dd if=/dev/zero bs=1k count=8 of=/path/to/data/pg_clog/0E3D and then you should be able to pg_dump the table (or at least find out if there is another corrupted tuple.) Beware that the corrupted tuple may be in there if it's supposed not to be, or it may not be if it's supposed to be. After you get your data back, I'd suggest running the usual hardware checking tools, and restore from the backup. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Use it up, wear it out, make it do, or do without"
Alvaro Herrera wrote: > On Tue, Sep 14, 2004 at 08:13:24AM -0400, Matthew T. O'Connor wrote: > You can create a pg_clog file (the one it's complaining about) filled > with zeros, using > dd if=/dev/zero bs=1k count=8 of=/path/to/data/pg_clog/0E3D Ok, I tried this, and it changed the error but hasn't fixed the problem now I get this: [dbmail2@dezeut dbmail2]$ psql Welcome to psql 7.4.2, the PostgreSQL interactive terminal. dbmail2=# delete from messageblks where messageblk_idnr = 7718; ERROR: could not access status of transaction 3822646358 DETAIL: could not read from file "/var/lib/pgsql/data/pg_clog/0E3D" at offset 139264: Success And in the log file I get this: ERROR: XX000: could not access status of transaction 3822646358 DETAIL: could not read from file "/var/lib/pgsql/data/pg_clog/0E3D" at offset 139264: Success LOCATION: SlruReportIOError, slru.c:634 Any more thoughts? Thanks again, Matthew
On Tue, Sep 14, 2004 at 08:01:13PM -0400, Matthew T. O'Connor wrote: > Alvaro Herrera wrote: > >On Tue, Sep 14, 2004 at 08:13:24AM -0400, Matthew T. O'Connor wrote: > >You can create a pg_clog file (the one it's complaining about) filled > >with zeros, using > >dd if=/dev/zero bs=1k count=8 of=/path/to/data/pg_clog/0E3D > > Ok, I tried this, and it changed the error but hasn't fixed the problem > now I get this: > > [dbmail2@dezeut dbmail2]$ psql > Welcome to psql 7.4.2, the PostgreSQL interactive terminal. > dbmail2=# delete from messageblks where messageblk_idnr = 7718; > ERROR: could not access status of transaction 3822646358 > DETAIL: could not read from file "/var/lib/pgsql/data/pg_clog/0E3D" at > offset 139264: Success Huh, sorry, the directions only created the first block of the file, but you needed the 17th ... dd if=/dev/zero bs=8k count=17 of=/path/to/data/pg_clog/0E3D I may be subject of a fencepost problem here, so if it doesn't work try with 18. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Limítate a mirar... y algun día veras"
Alvaro Herrera wrote: > Huh, sorry, the directions only created the first block of the file, but > you needed the 17th ... > > dd if=/dev/zero bs=8k count=17 of=/path/to/data/pg_clog/0E3D > > I may be subject of a fencepost problem here, so if it doesn't work try > with 18. I don't know if we are making progress but I am getting a different error now :-) I did the dd command again this time with count=18. Now when I try to delete the tuple I get this: dbmail2=# delete from messageblks where oid = 2916427; ERROR: attempted to delete invisible tuple The Postmaster log has this to say: ERROR: XX000: attempted to delete invisible tuple LOCATION: heap_delete, heapam.c:1258 Thanks again for the help. Matthew
On Tue, Sep 14, 2004 at 10:01:21PM -0400, Matthew T. O'Connor wrote: > Alvaro Herrera wrote: > >Huh, sorry, the directions only created the first block of the file, but > >you needed the 17th ... > > > >dd if=/dev/zero bs=8k count=17 of=/path/to/data/pg_clog/0E3D > > > >I may be subject of a fencepost problem here, so if it doesn't work try > >with 18. > > I don't know if we are making progress but I am getting a different > error now :-) > > I did the dd command again this time with count=18. Now when I try to > delete the tuple I get this: > > dbmail2=# delete from messageblks where oid = 2916427; > ERROR: attempted to delete invisible tuple I think I know what is going on, but I'm not sure how to solve the problem. If I were in your situation I'd edit the data file and stash FrozenTransactionId in the Xmin and InvalidTransactionId in Xmax for that tuple. Short of using an hex editor, I'm not sure how to do that, however, and before doing anything that foolish I'd backup the file two or three times just to be sure. You may try using pgfsck (http://svana.org/kleptog/pgsql/pgfsck.html) or pg_filedump (http://sources.redhat.com/rhdb) and see how lucky you get with the hex editor ... (memories of cheating in VGA Planets by use of said hex editor many years ago now come to my mind ...) -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) Major Fambrough: You wish to see the frontier? John Dunbar: Yes sir, before it's gone.
Alvaro Herrera wrote: > On Tue, Sep 14, 2004 at 10:01:21PM -0400, Matthew T. O'Connor wrote: >>I don't know if we are making progress but I am getting a different >>error now :-) >> >>I did the dd command again this time with count=18. Now when I try to >>delete the tuple I get this: >> >>dbmail2=# delete from messageblks where oid = 2916427; >>ERROR: attempted to delete invisible tuple Well after using dd to create a few missing pg_clog files, I was finally able to do a vacuum of the whole database which allowed me to delete the problematic tuple which allowed me to do a pg_dump of the database! So finally some progress. However, I then ran into a new problem while trying to dump another database. Now I get this: dbmail=# SELECT * from messageblks ; ERROR: invalid page header in block 85646 of relation "pg_toast_2353340" Any ideas on this new issue? > You may try using pgfsck (http://svana.org/kleptog/pgsql/pgfsck.html) or > pg_filedump (http://sources.redhat.com/rhdb) and see how lucky you get > with the hex editor ... I looked at pgfsck and it seems that pgfsck was last updated for 7.3. I'll take a look at pg_filedump. Thanks again, Matthew
On Wed, Sep 15, 2004 at 02:37:13PM -0400, Matthew T. O'Connor wrote: > However, I then ran into a new problem while trying to dump another > database. Now I get this: > > dbmail=# SELECT * from messageblks ; > ERROR: invalid page header in block 85646 of relation "pg_toast_2353340" > > Any ideas on this new issue? IMO this is FUBAR ... try enabling zero_damaged_pages. Beware that data on damaged pages will be lost. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La libertad es como el dinero; el que no la sabe emplear la pierde" (Alvarez)