Bug (#3484) - Invalid page header again - Mailing list pgsql-bugs
From | alex |
---|---|
Subject | Bug (#3484) - Invalid page header again |
Date | |
Msg-id | 47627593.2020303@clickware.de Whole thread Raw |
Responses |
Re: Bug (#3484) - Invalid page header again
Re: Bug (#3484) - Invalid page header again |
List | pgsql-bugs |
Hi folks, we had reported about various database problems some weeks ago. Since then we have updated the database to release 8.2.4 und the linux kernel to 2.6.22.6-smp. Now we got an error again: IN SHORT: - data is inserted - the same data is read and exported successfully - after a nightly vacuum analyze the data is corrupted and cannot be read any more. Error message(s): see below CONCLUSION Apparently the data is corrupted by "vacuum analyze". IN DETAIL: We got an error during the nightly dump again: pg_dump: SQL command failed pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551610 Some records in table "transaktion" (which contains about 45 million records) are corrupted. But the data has not been corrupted after insertion, it must have been corrupted later. Let us explain the track of the error: 1. 2007/12/07 ~3:30h: The (now corrupted) data was inserted successfully 2. 2007/12/07 7h : The (now corrupted) data was read and exported successfully! ( We run an export of the data every morning at 7h, which exports the data we retrieved/inserted during the last 24 hours ) 3. 2007/12/07 22h: Database was dumped successfully 4. 2007/12/07 23:15h: Database "vacuum analyze" was run successfully 5. 2007/12/08 22h: The database dump got the error described above: pg_dump: SQL command failed pg_dump: Error message from server: ERROR: invalid memory alloc request size 18446744073709551610 6. 2007/12/08 23h: "vacuum analyze" threw an error: INFO: vacuuming "public.transaktion" WARNING: relation "transaktion" TID 1240631/12: OID is invalid ERROR: invalid page header in block 1240632 of relation "transaktion" 7. 2007/12/10 : We started the export of the data ( which runs every morning ) for the last days again. These exports use the same SQL-Commands as the automatical run. But now, we got an error when exporting the data for 2007/12/07. ERROR: invalid memory alloc request size 18446744073709551610 The process exporting the same set of data ran successfully in the morning of the 2007/12/07: We are very sure, that the data has not been manipulated since the time of insertion, because the error occurs on the testing system and at the moment no tests except from inserting and exporting the data are done. 8. 2007/12/14 When we now start a select over the corrupted data, we get the error message: ERROR: could not access status of transaction 313765632 DETAIL: Could not open file "pg_clog/012B": Datei oder Verzeichnis nicht gefunden. We are using Linux version: 2.6.22.6-smp. Hardware system: 2 dual core processor ( Intel(R) Xeon(TM) CPU 2.80GHz ) postgres-Version: 8.2.4 -------- Original-Nachricht -------- Betreff: Missing pg_clog file / corrupt index / invalid page header Datum: Wed, 05 Sep 2007 08:18:31 +0200 Von: alex <an@clickware.de> Organisation: click:ware GmbH An: pgsql-bugs@postgresql.org My colleague Marc Schablewski reported this Bug (#3484) the first time at the end of July. The described problem occured twice at our database and now it happened again. Summary ========== Various errors like: "invalid page header in block 8658 of relation", "could not open segment 2 of relation 1663/77142409/266753945 (target block 809775152)", "ERROR: could not access status of transaction 2134240 DETAIL: could not open file "pg_clog/0002": File not found", "CEST PANIC: right sibling's left-link doesn't match" on the following system: Postgres 8.1.8 SUsE Linux Kernel 2.6.13-15.8-smp 2 Intel XEON Processors with 2 cores each ECC-Ram Hardware Raid (mirror set) Detailed description ======================= The message was thrown by the nightly pg_dump: pg_dump: ERROR: invalid page header in block 8658 of relation "import_data_zeilen" pg_dump: SQL command to dump the contents of table "import_data_zeilen" failed: PQendcopy() failed. pg_dump: Error message from server: ERROR: invalid page header in block 8658 of relation "import_data_zeilen" pg_dump: The command was: COPY public.import_data_zeilen (id, eda_id, zeile, man_id, sta_id) TO stdout; A manually executed dedicated dump on the concerned table was processed successfully ( at daytime! ) We were really suprised! Also, select-queries (using indexes) on the table succeeded. (in the past when the error occured, select-queries failed). So, no repair seemed to be needed for the table. The following night, the pg_dump succeeded, but the "vacuum analyze" (executed after the pg_dump) threw the same error: INFO: vacuuming "public.import_data_zeilen" ERROR: invalid page header in block 8658 of relation "import_data_zeilen" Any select on this table using indexes now failed! ( if the resultset contained the corrupted data ) This behaviour is very confusing. Re-creating the table solved the problem. However, the damaged rows were lost. We have two systems, one active, one for tests. They are nearly identical, having similar hardare, using the same software and they are running under the same load. The errors always occured on the active server, the test-server didn't run into errors after upgrading both servers from 8.1.3 to 8.1.8. So even though no hardware errors were detected (neither ECC-RAM-Errors nor disk errors) we decided to swap the server's roles, to find out if its a hardware or software problem. This was 12 days ago. Now we got another error, again on the active system (which now uses the hardware from the other system except for the one of the hard disks in the raid), which was thrown by an insert statement done by the software: org.postgresql.util.PSQLException: ERROR: could not open segment 2 of relation 1663/77142409/266753945 (target block 809775152): Datei oder Verzeichnis nicht gefunden. Obviously we have a problem with the he active server. But its unlikely to be a hardware problem, because we changed the hard disks and the error occured at the same (software) system. Also we are using ECC-Ram and a raid system (mirrorset) with hardware raid controller, which hasn't reported any errors. We read the last post/thread concerning this bug. In this thread the problem was connected to some kernel bug in 2.6.11. We are using a higher Linux version: 2.6.13-15.8-smp. Hardware system: 2 dual core processor ( Intel(R) Xeon(TM) CPU 2.80GHz ) postgres-Version: 8.1.8 We have done a lot of database maintenance 4 days ago, which among other updates dropped about 10 indexes on one big table ( 35'000'000 recordsets ) and created some other 10 indexes (for better performance). Given that the problem occurred on two different machines we are very sure that it is *not* a hardware problem. We would really appreciate any help with our problems. Thanks in advance A. Nitzschke
pgsql-bugs by date: