Catalog corruption - Mailing list pgsql-hackers

From Mariel Cherkassky
Subject Catalog corruption
Date
Msg-id CA+t6e1k5QpakOvd-KdhcQqmWSj=2Bb+N7chh+EUk9CQ-9QJZvw@mail.gmail.com
Whole thread Raw
Responses Re: Catalog corruption
Re: Catalog corruption
List pgsql-hackers
Hi,
I sent already an email about this topic to pgsql-admins but I think that it might be more relevant to this mailing list. I'm trying to investigate a corruption that happened on a machine of one of our clients. 

A little background : 
-os Centos 6.6
-PostgreSQL v9.2.5
-I was asked to understand why the corruption happened and how can we fix it. I copied the data dir to my machine on my local pc and started investigate it. They realized that there is a problem when they tried to backup the database via pg_dump. The error they got :

pg_dump: query returned 2 rows instead of one: SELECT tableoid, oid, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = datdba) AS dba, pg_encoding_to_char(encoding) AS encoding, datcollate, datctype, datfrozenxid, (SELECT spcname FROM pg_tablespace t WHERE t.oid = dattablespace) AS tablespace, shobj_description(oid, 'pg_database') AS description FROM pg_database WHERE datname = 'db1'

So I checked and indeed there were duplicated rows in the pg_roles / pg_database tables. Moreover, there were multiple values in all the system catalogs !
The first think I did as Tom Lane suggested is to upgrade to v9.2.24. Afterwards I vacuumed all the databases but nothing helped. I tried to reindex the databases but I got the next error :

2018-08-28 21:50:03 +08 db2 24360  ERROR:  could not access status of transaction 32212695
2018-08-28 21:50:03 +08 db2 24360  DETAIL:  Could not open file "pg_subtrans/01EB": No such file or directory.

So I tried to analyze all the system catalogs manualy one by one : 

select 'VACUUM ANALYZE pg_catalog.'||table_name from information_schema.tables where table_schema = 'pg_catalog' and table_type <> 'VIEW';
                     ?column?
---------------------------------------------------
 VACUUM ANALYZE pg_catalog.pg_statistic
 VACUUM ANALYZE pg_catalog.pg_type
 VACUUM ANALYZE pg_catalog.pg_authid
 VACUUM ANALYZE pg_catalog.pg_proc
 VACUUM ANALYZE pg_catalog.pg_class
 VACUUM ANALYZE pg_catalog.pg_user_mapping
.....

and I got error only on the next system table : 

afa=#  ANALYZE pg_catalog.pg_shdepend;
ERROR:  could not access status of transaction 32212695
DETAIL:  Could not open file "pg_subtrans/01EB": No such file or directory.


When I try to reindex the table I'm getting the same error.

Solutions I tried : 

-vacuum all the databases
-reindex system db_name - > raise the same error (could not access status...)
-upgrade to v9.2.24 (still got corruption).
-Use zero_damaged_pages=0
-generate an empty subtransaction file with dd
-Trying to reindex system indexes in single mode -> raise same error.

*Tried to delete the duplicated values in some of the system tables by the ctid value, but nothing is deleted.


Any idea how can I continue ?





pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: Would it be possible to have parallel archiving?
Next
From: "Jonathan S. Katz"
Date:
Subject: Re: Something's busted in plpgsql composite-variable handling