Re: emergency outage requiring database restart - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: emergency outage requiring database restart
Date
Msg-id CAHyXU0wLgMvD_KVJyfZhACBpkfDbPEawkqbx2EObYxMt2O=kMA@mail.gmail.com
Whole thread Raw
In response to Re: emergency outage requiring database restart  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: emergency outage requiring database restart
Re: emergency outage requiring database restart
List pgsql-hackers
On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>> castaging-#   SELECT ...
>> ERROR:  42809: "pg_cast_oid_index" is an index
>> LINE 11:   FROM ApartmentSample s
>>                 ^
>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>
>> should I be restoring from backups?
>
> It's pretty clear to me that you've got catalog corruption here.  You
> can try to fix things manually as they emerge, but that sounds like a
> fool's errand.

Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion.  A lot of the OIDs seemed to be
pointing at the wrong thing.  Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted.  However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here.  Having said that, this definitely smells and feels
like something on the infrastructure side.  I'll follow up if I have
any useful info.

merlin



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Query cancel seems to be broken in master since Oct 17
Next
From: Tom Lane
Date:
Subject: Re: Query cancel seems to be broken in master since Oct 17