corruption in system tables (9.1.13) - Mailing list pgsql-general

From Jan-Pieter Cornet
Subject corruption in system tables (9.1.13)
Date
Msg-id 541C4BED.4030104@xs4all.net
Whole thread Raw
List pgsql-general
Hi,

One of our postgres database clusters suddenly developed a corruption in the system tables. I tried to debug this as
bestI could, but haven't found a root cause yet. I'm mainly seeking more pointers to attack this. 

The error message that suddenly appeared for a lot of queries (but not all queries), is:

ERROR:  missing chunk number 0 for toast value 4132453 in pg_toast_2619

The database was running 9.1.13 at the time (upgraded to 9.1.14 now, but the releasenotes for 9.1.14 do not mention
thissort of corruption. We do not use GiST indices). We cannot upgrade to 9.3.x at the moment because of dependencies
onother components (it's high on the list of priorities, though). 

I googled this error and found that sometimes judicious use of "reindex" and "vacuum full" might solve this, but it
didn't.

The corrupt database was renamed and put aside for study. We also saved the pg_xlog files from around the time the
corruptionoccurred. Unfortunately, we do not have pg_basebackup-like backups (yet - also high on the wishlist now). We
restoredto a fresh database from pg_dump made 30 minutes before the corruption occurred, and were able to replay the
missingtransactions using detailed logging from the connecting system. We also made a pg_basebackup-copy of the system
includingthe faulty database. 

There are no IO errors, so this does not look like disk corruption. Also, this machine uses streaming replication to
replicateto a hot standby slave, and the slave is corrupt in the exact same way (producing the exact same errors). That
tome is more proof that this is not caused by a faulty disk, since database changes are (as far as I know) stored in
thebase/ directories, and at the same time streamed to streaming replication slaves (and written to pg_xlog), right? 

This system has been in production since July, and has been running fine ever since. It could still be a hardware
problem(memory corruption?), but if it is, it's quite rare. 

As an example, this is what it looks like to try to query the broken database:
sim_stuk=# \d
                       List of relations
 Schema |                Name                |   Type   | Owner
--------+------------------------------------+----------+-------
 public | _dbversioning                      | table    | admin
 public | _dbversioning_id_seq               | sequence | admin
 public | sim_instance                       | table    | admin
 public | sim_instance_id_seq                | sequence | admin
[... works fine ...]

sim_stuk=# \d sim_instance
ERROR:  missing chunk number 0 for toast value 4132453 in pg_toast_2619
sim_stuk=# select * from sim_instance limit 1;
ERROR:  missing chunk number 0 for toast value 4132461 in pg_toast_2619
sim_stuk=# select * from pg_attribute limit 1;
ERROR:  missing chunk number 0 for toast value 4132453 in pg_toast_2619

There are some tables we can still query, though, among these is a table containing "status" fields and timestamps, so
wecould learn approximately how many changes were missing from the backup that we restored. 

Any suggestions on how to proceed? Thanks!

--
Jan-Pieter Cornet <johnpc@xs4all.net>
"Any sufficiently advanced incompetence is indistinguishable from malice."
    - Grey's Law


Attachment

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: a couple questions about convert()
Next
From: cowwoc
Date:
Subject: Re: Why isn't Java support part of Postgresql core?