corruption in system tables (9.1.13) - Mailing list pgsql-general
From | Jan-Pieter Cornet |
---|---|
Subject | corruption in system tables (9.1.13) |
Date | |
Msg-id | 541C4BED.4030104@xs4all.net Whole thread Raw |
List | pgsql-general |
Hi, One of our postgres database clusters suddenly developed a corruption in the system tables. I tried to debug this as bestI could, but haven't found a root cause yet. I'm mainly seeking more pointers to attack this. The error message that suddenly appeared for a lot of queries (but not all queries), is: ERROR: missing chunk number 0 for toast value 4132453 in pg_toast_2619 The database was running 9.1.13 at the time (upgraded to 9.1.14 now, but the releasenotes for 9.1.14 do not mention thissort of corruption. We do not use GiST indices). We cannot upgrade to 9.3.x at the moment because of dependencies onother components (it's high on the list of priorities, though). I googled this error and found that sometimes judicious use of "reindex" and "vacuum full" might solve this, but it didn't. The corrupt database was renamed and put aside for study. We also saved the pg_xlog files from around the time the corruptionoccurred. Unfortunately, we do not have pg_basebackup-like backups (yet - also high on the wishlist now). We restoredto a fresh database from pg_dump made 30 minutes before the corruption occurred, and were able to replay the missingtransactions using detailed logging from the connecting system. We also made a pg_basebackup-copy of the system includingthe faulty database. There are no IO errors, so this does not look like disk corruption. Also, this machine uses streaming replication to replicateto a hot standby slave, and the slave is corrupt in the exact same way (producing the exact same errors). That tome is more proof that this is not caused by a faulty disk, since database changes are (as far as I know) stored in thebase/ directories, and at the same time streamed to streaming replication slaves (and written to pg_xlog), right? This system has been in production since July, and has been running fine ever since. It could still be a hardware problem(memory corruption?), but if it is, it's quite rare. As an example, this is what it looks like to try to query the broken database: sim_stuk=# \d List of relations Schema | Name | Type | Owner --------+------------------------------------+----------+------- public | _dbversioning | table | admin public | _dbversioning_id_seq | sequence | admin public | sim_instance | table | admin public | sim_instance_id_seq | sequence | admin [... works fine ...] sim_stuk=# \d sim_instance ERROR: missing chunk number 0 for toast value 4132453 in pg_toast_2619 sim_stuk=# select * from sim_instance limit 1; ERROR: missing chunk number 0 for toast value 4132461 in pg_toast_2619 sim_stuk=# select * from pg_attribute limit 1; ERROR: missing chunk number 0 for toast value 4132453 in pg_toast_2619 There are some tables we can still query, though, among these is a table containing "status" fields and timestamps, so wecould learn approximately how many changes were missing from the backup that we restored. Any suggestions on how to proceed? Thanks! -- Jan-Pieter Cornet <johnpc@xs4all.net> "Any sufficiently advanced incompetence is indistinguishable from malice." - Grey's Law
Attachment
pgsql-general by date: