Bug (#3484) - Invalid page header again - Mailing list pgsql-bugs

From alex
Subject Bug (#3484) - Invalid page header again
Date
Msg-id 47627593.2020303@clickware.de
Whole thread Raw
Responses Re: Bug (#3484) - Invalid page header again
Re: Bug (#3484) - Invalid page header again
List pgsql-bugs
Hi folks,

we had reported about various database problems some weeks ago.
Since then we have updated the database to release 8.2.4 und the
linux kernel to 2.6.22.6-smp. Now we got an error again:


IN SHORT:
- data is inserted
- the same data is read and exported successfully
- after a nightly vacuum analyze the data is corrupted and cannot be
read any more. Error message(s): see below

CONCLUSION
Apparently the data is corrupted by  "vacuum analyze".


IN DETAIL:

We got an error during the nightly dump again:
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  invalid memory alloc request
size 18446744073709551610

Some records in table "transaktion" (which contains about 45 million
records) are corrupted.
But the data has not been corrupted after insertion, it must have been
corrupted later.

Let us explain the track of the error:

1. 2007/12/07 ~3:30h: The (now corrupted) data was inserted successfully
2. 2007/12/07 7h : The (now corrupted) data was read and exported
successfully!
( We run an export of the data every morning at 7h, which exports the
data we retrieved/inserted during the last 24 hours )
3. 2007/12/07 22h: Database was dumped successfully
4. 2007/12/07 23:15h: Database "vacuum analyze" was run successfully
5. 2007/12/08 22h: The database dump got the error described above:
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  invalid memory alloc request
size 18446744073709551610
6. 2007/12/08 23h: "vacuum analyze" threw an error:
INFO:  vacuuming "public.transaktion"
WARNING:  relation "transaktion" TID 1240631/12: OID is invalid
ERROR:  invalid page header in block 1240632 of relation "transaktion"
7. 2007/12/10 : We started the export of the data ( which runs every
morning ) for the last days again. These exports use the same
SQL-Commands as the automatical run.
But now, we got an error when exporting the data for 2007/12/07.
ERROR: invalid memory alloc request size 18446744073709551610
The process exporting the same set of data ran successfully in the
morning of the 2007/12/07:
We are very sure, that the data has not been manipulated since the time
of insertion, because the error occurs on the testing system and at the
moment no tests except from inserting and exporting the data are done.
8. 2007/12/14
When we now start a select over the corrupted data, we get the error
message:
ERROR:  could not access status of transaction 313765632
DETAIL:  Could not open file "pg_clog/012B": Datei oder Verzeichnis
nicht gefunden.


We are using Linux version: 2.6.22.6-smp.
Hardware system: 2 dual core processor ( Intel(R) Xeon(TM) CPU 2.80GHz )
postgres-Version: 8.2.4


-------- Original-Nachricht --------
Betreff: Missing pg_clog file / corrupt index / invalid page header
Datum: Wed, 05 Sep 2007 08:18:31 +0200
Von: alex <an@clickware.de>
Organisation: click:ware GmbH
An: pgsql-bugs@postgresql.org

My colleague Marc Schablewski reported this Bug (#3484) the first time
at the end of July.
The described problem occured twice at our database and now it happened
again.

Summary
==========

Various errors like:
"invalid page header in block 8658 of relation",
"could not open segment 2 of relation 1663/77142409/266753945 (target
block 809775152)",
"ERROR:  could not access status of transaction 2134240 DETAIL:  could
not open file "pg_clog/0002": File not found",
"CEST PANIC:  right sibling's left-link doesn't match"

on the following system:
Postgres 8.1.8
SUsE Linux Kernel 2.6.13-15.8-smp
2 Intel XEON Processors with 2 cores each
ECC-Ram
Hardware Raid (mirror set)

Detailed description
=======================

The message was thrown by the nightly pg_dump:
pg_dump: ERROR:  invalid page header in block 8658 of relation
"import_data_zeilen"
pg_dump: SQL command to dump the contents of table "import_data_zeilen"
failed: PQendcopy() failed.
pg_dump: Error message from server: ERROR:  invalid page header in block
8658 of relation "import_data_zeilen"
pg_dump: The command was: COPY public.import_data_zeilen (id, eda_id,
zeile, man_id, sta_id) TO stdout;

A manually executed dedicated dump on the concerned table was processed
successfully ( at daytime! )
We were really suprised!
Also, select-queries (using indexes) on the table succeeded.
(in the past when the error occured, select-queries failed).
So, no repair seemed to be needed for the table.

The following night, the pg_dump succeeded, but the "vacuum
analyze" (executed after the pg_dump) threw the same error:

INFO:  vacuuming "public.import_data_zeilen"
ERROR:  invalid page header in block 8658 of relation "import_data_zeilen"

Any select on this table using indexes now failed!
( if the resultset contained the corrupted data )

This behaviour is very confusing.

Re-creating the table solved the problem. However, the damaged rows were
lost.

We have two systems, one active, one for tests.
They are nearly identical, having similar hardare, using the same
software and they are running under the same load.
The errors always occured on the active server, the test-server didn't
run into errors after upgrading both servers from 8.1.3 to 8.1.8.

So even though no hardware errors were detected (neither ECC-RAM-Errors
nor disk errors) we decided to swap the server's roles, to find out if
its a hardware or software problem.
This was 12 days ago.

Now we got another error, again on the active system (which now uses the
hardware from the other system except for the one of the hard disks in
the raid), which was thrown by an insert statement done by the software:
org.postgresql.util.PSQLException: ERROR: could not open segment 2 of
relation 1663/77142409/266753945 (target block 809775152): Datei oder
Verzeichnis nicht gefunden.

Obviously we have a problem with the he active server.
But its unlikely to be a hardware problem, because we changed the hard
disks and the error occured at the same (software) system.
Also we are using ECC-Ram and a raid system (mirrorset) with hardware
raid controller, which hasn't reported any errors.

We read the last post/thread concerning this bug. In this thread the
problem was connected to some kernel bug in 2.6.11.
We are using a higher Linux version: 2.6.13-15.8-smp.
Hardware system: 2 dual core processor ( Intel(R) Xeon(TM) CPU 2.80GHz )
postgres-Version: 8.1.8

We have done a lot of database maintenance 4 days ago, which among other
updates dropped about 10 indexes on one big table ( 35'000'000
recordsets ) and created some other 10 indexes (for better performance).

Given that the problem occurred on two different machines we are very
sure that it is *not* a hardware problem.

We would really appreciate any help with our problems.

Thanks in advance

     A. Nitzschke

pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: BUG #3811: Getting multiple values from a sequence generator
Next
From: "Geert Bijloos"
Date:
Subject: BUG #3816: Timezone bug