Re: emergency outage requiring database restart - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: emergency outage requiring database restart
Date
Msg-id CAHyXU0zCezq3Zq63GEvDYebW6j8tXoKM4mk54d3jSrQDzyDMNA@mail.gmail.com
Whole thread Raw
In response to Re: emergency outage requiring database restart  (Merlin Moncure <mmoncure@gmail.com>)
Responses Re: emergency outage requiring database restart  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Merlin Moncure wrote:
>>
>>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>>> castaging-#   SELECT ...
>>> ERROR:  42809: "pg_cast_oid_index" is an index
>>> LINE 11:   FROM ApartmentSample s
>>>                 ^
>>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>>
>>> should I be restoring from backups?
>>
>> It's pretty clear to me that you've got catalog corruption here.  You
>> can try to fix things manually as they emerge, but that sounds like a
>> fool's errand.
>
> Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> to be to the system catalogs with at least two critical tables dropped
> or inaccessible in some fashion.  A lot of the OIDs seemed to be
> pointing at the wrong thing.  Couple more datapoints here.
>
> *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> *) Another database on the same cluster was not impacted.  However
> it's more olap style and may not have been written to during the
> outage
>
> Now, this infrastructure running this system is running maybe 100ish
> postgres clusters and maybe 1000ish sql server instances with
> approximately zero unexplained data corruption issues in the 5 years
> I've been here.  Having said that, this definitely smells and feels
> like something on the infrastructure side.  I'll follow up if I have
> any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

merlin



pgsql-hackers by date:

Previous
From: Pavan Deolasee
Date:
Subject: Re: Indirect indexes
Next
From: Robert Haas
Date:
Subject: Re: Indirect indexes