Home > mailing lists

Re: emergency outage requiring database restart - Mailing list pgsql-hackers

From	Merlin Moncure
Subject	Re: emergency outage requiring database restart
Date	October 19, 2016 13:54:52
Msg-id	CAHyXU0zCezq3Zq63GEvDYebW6j8tXoKM4mk54d3jSrQDzyDMNA@mail.gmail.com Whole thread
In response to	Re: emergency outage requiring database restart (Merlin Moncure <mmoncure@gmail.com>)
Responses	Re: emergency outage requiring database restart
List	pgsql-hackers

Tree view

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Merlin Moncure wrote:
>>
>>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>>> castaging-#   SELECT ...
>>> ERROR:  42809: "pg_cast_oid_index" is an index
>>> LINE 11:   FROM ApartmentSample s
>>>                 ^
>>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>>
>>> should I be restoring from backups?
>>
>> It's pretty clear to me that you've got catalog corruption here.  You
>> can try to fix things manually as they emerge, but that sounds like a
>> fool's errand.
>
> Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> to be to the system catalogs with at least two critical tables dropped
> or inaccessible in some fashion.  A lot of the OIDs seemed to be
> pointing at the wrong thing.  Couple more datapoints here.
>
> *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> *) Another database on the same cluster was not impacted.  However
> it's more olap style and may not have been written to during the
> outage
>
> Now, this infrastructure running this system is running maybe 100ish
> postgres clusters and maybe 1000ish sql server instances with
> approximately zero unexplained data corruption issues in the 5 years
> I've been here.  Having said that, this definitely smells and feels
> like something on the infrastructure side.  I'll follow up if I have
> any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

merlin

pgsql-hackers by date:

From: Pavan Deolasee
Date: 19 October 2016, 13:53:54
Subject: Re: Indirect indexes

From: Robert Haas
Date: 19 October 2016, 14:25:46
Subject: Re: Indirect indexes

Re: emergency outage requiring database restart - Mailing list pgsql-hackers

Previous

Next